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ABSTRACT 


This  report  demonstrates  the  applicability  of  classical  statistical 
techniques  to  problems  involving  compression  and  classification  of  multi¬ 
variate  data.  The  theoretical  foundations  of  two  such  techniques,  intrinsic 
analysis  and  discriminant  analysis,  are  treated  in  detail.  Efficient  digital 
computer  implementation  is  discussed,  including  the  combined  application  of 
intrinsic  and  discriminant  analysis  and  a  new  algorithm  for  computing  approx¬ 
imate  intrinsic  bases  for  very  large  problems.  Experimental  results  are 
presented  on  the  application  of  these  techniques  as  feature  extractors  in  a 
signal  classification  problem.  Also  included  is  a  description  of  the  inter¬ 
active  graphics -oriented  system  software  which  has  been  developed  to  facili¬ 
tate  the  application  of  these  techniques. 
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INTRODUCTION 

The  research  effort  reported  herein  has  been  directed  toward  the  im¬ 
plementation  of  an  interactive,  graphics  oriented  computer  system  for  the 
representation,  analysis  and  classification  of  multivariate  data.  Work  has 
proceeded  in  two  parallel  areas:  development  of  the  appropriate  analytical 
techniques  for,  and  design  of,  a  software  system  tailored  to  the  requirements 
of  such  a  facility.  The  ultimate  purpose  of  the  system  is  to  provide  a  tool 
for  the  systems  engineer  or  scientist  dealing  with  large  scale  problems  re¬ 
quiring  reduction  and/or  classification  of  high  dimensional  data,  by  supply¬ 
ing  a  means  of  evaluating  the  effectiveness  (or  lack  of  effectiveness)  of 
proposed  approaches  to  his  specific  problems.  The  system  can  also  be  used 
to  investigate  the  interrelationships  among  standard  analytical  techniques 
and  to  develop  new  data  analysis  methods. 

The  typical  application  deals  with  sets  of  data  whose  members  are 
measurement  vectors,  for  example,  simultaneous  outputs  of  a  bank  of  sensors 
or  discrete  time  samples  of  a  continuous  function.  It  is  easy  to  display 
these  vectors  component  by  component,  but  this  reveals  little  information 
about  the  overall  statistical  properties  of  the  random  processes  from  which 
they  have  been  sampled.  Therefore  it  is  desirable  to  find  two-dimensional 
representations  for  the  measurement  space  in  which  the  members  of  entire 
data  sets  appear  as  projected  points.  If  the  coordinates  of  this  repre¬ 
sentation  are  chosen  Judiciously,  the  resulting  projection  may  yield  valu¬ 
able  insight  into  the  statistical  relationships  among  the  data  elements. 

One  of  the  major  goals  of  this  effort  has  been  the  development  of 
analytical  methods  for  selecting  such  coordinates.  These  include  means  for 
efficient  representation  of  data  sets  with  highly  redundant  measurements 
(intrinsic  analysis,  Section  II)  and  for  viewing  the  separability  cf  several 
distinct  data  sets  (discriminant  analysis.  Section  III).  Classification  prob¬ 
lems  require  automatic  pattern  recognition  algorithms.  The  pattern  classifi¬ 
cation  problem  and  several  specific  methods  are  discussed  in  Section  IV. 
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The  primary  consideration  in  the  selection  of  all  of  these  methods  is 
that  they  have  a  3ound  mathematical  formulation.  This  is  to  assure  that  the 
resulting  system  is  sufficiently  general  to  be  applicable  to  a  wide  range  of 
problems,  and  to  facilitate  the  analysis  of  its  performance.  We,  therefore, 
have  avoided  methods  requiring  interactive  human  supervision  in  their  execu¬ 
tion.  Human  interaction  is  generally  restricted  to  the  selection  of  the  se¬ 
quence  of  processes,  with  their  parameters,  to  be  applied  to  the  data  and 
control  of  the  interactive  display  programs. 

By  applying  sequences  of  elementary  processes  and  observing  the  re¬ 
sults  at  each  stage,  the  user  may  develop  compound  procedures  appropriate  to 
his  application.  To  emphasize  the  value  of  this  building  block  approach, 
this  report  stresses  the  interrelationships  among  the  analytical  techniques. 
An  example  involving  the  application  of  information  compression  methods  as 
feature  extractors  to  improve  the  performance  of  a  pattern  classification 
algorithm  is  presented  in  Section  V. 

The  computer  software  which  has  been  designed  to  aid  in  the  implemen¬ 
tation  of  this  system  is  described  in  Section  VI.  This  includes  a  disk  file 
system,  an  on-line  monitor,  an  interactive  vector  projection  display  program 
and  an  extended  macro  assembly  language,  in  addition  to  the  mathematical 
routines. 

The  mathematical  developments  which  follow  are  descriptive  enough 
for  the  general  reader  with  limited  mathematical  background  to  understand 
the  underlying  concepts,  although  familiarity  with  probability  theory  and 
matrix  algebra  is  desirablt.  Extensive  U3e  is  made  of  the  concept  of  a  ran¬ 
dom  vector,  which  is  a  vector  whose  components  are  random  variables.  No  no- 
tational  convention  is  used  to  distinguish  scalars,  vectors  and  matrices; 
the  distinctions  should  be  clear  from  context.  Vectors  are  always  column 
vectors;  transposes  of  vectors  are  always  row  vectors.  Specific  notations 
are  defined  as  needed  in  the  text. 
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INFORMATION  COMPRESSION 

A  major  problem  in  many  data  analysis  and  classification  problems,  as 
well  as  data  transmission  applications,  is  the  high  dimensionality  of  the 
data.  Data  vectors  arising  from  sampled  continuous  signals  or  their  Fourier 
transforms,  for  example,  typically  contain  hundreds  or  thousands  of  sample 
points.  All  but  the  most  straightforward  data  analysis  and  pattern  recog¬ 
nition  techniques  tend  to  bog  down  in  numerical  computations  or  become  in¬ 
effective  when  dealing  with  such  large  problems.  One  way  to  alleviate  such 
problems  is  to  find  a  more  compact  representation  for  the  data  which  pre¬ 
serves  as  much  as  possible  of  its  original  information  content.  We  shall 
refer  to  this  process  as  information  compression  or  dimension  reduction. 

The  canonical  data  representation  to  be  developed  here  is  similar  to 
Fourier  analysis  in  that  its  components  are  inner  products  of  the  data  with 
members  of  an  orthogonal  function  set.  However,  the  form  of  the  orthogonal 
functions  is  not  restricted  to  sinusoids.  Thus  it  may  be  thought  of  as 
generalized  harmonic  analysis.  It  also  has  the  desirable  property  that  its 
components  are  uncorrelated. 

Essentially  the  same  technique  has  been  developed  by  many  authors  in 
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several  disciplines.  In  multivariate  statistics  (Wilks  and  Anderson  )  it 
Is  referred  to  as  principal  components  analysis.  It  was  applied  by  Kramer 
and  Mathews^  to  speech  bandwidth  compression  by  encoding  the  output  of  a 

4 

channel  vocoder.  The  term  intrinsic  analysis  is  due  to  Young  and  Huggins 

5  6 

and  is  also  used  by  Walter  and  Colomb.  In  communication  theory  (Daven- 
7  8 

port  and  Root')  and  probability  theory  (Lofeve  )  it  appears  as  Loeve- 
Karhunen  analysis.  The  technique  is  equally  applicable  to  continuous  (real 
or  complex -valued)  functions  or  vectors.  For  our  purposes,  the  vector  formu¬ 
lation  is  more  convenient,  and  will  be  developed  here.  The  extension  to  con- 

6  9 

tinuous  functions  is  routine  (see  Colomb  or  Watanabe  ). 

The  approximate  data  representations  obtained  through  intrinsic 


3 
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analysis  are  optimal  in  the  least  mean  square  error  sense.  It  develops  that 

they  are  also  optimal  in  the  sense  of  minimizing  an  entropy  function  defined 

9 

on  the  coefficients.  This  is  shown  by  Watanabe,  who  relates  these  two  prop¬ 
erties  in  the  context  of  a  pattern  clustering  and  recognition  problem.  An 

2 

instructive  proof  of  the  error  minimization  property  is  given  by  Anderson. 
Mostly  for  heuristic  purposes,  we  offer  here  a  simplified  version  which  re¬ 
lies  more  heavily  on  algebraic  eigenvalue  theory.  We  will  also  relate 
Watanabe's  results  on  entropy  minimization. 

1 .  Intrinsic  Analysis 

Let  X  be  a  random  vector  in  the  n-dimensional  real  vector  space  V 
with  probability  density  function  p(X).  If  F(X)  is  a  function  of  X,  the 
expectation  of  F(X)  is  defined  by 

E  F(X)  =  J  F(X)  p(X)  dX 
V 

The  mean  vector  u  of  X  is 

u  =  EX 

and  the  autocorrelation  matrix  A  of  X  is 

A  =  EX  X1 

where  prime  denotes  transpose.  A  is  symmetric  (A(i,t1)  ■  A(J,i))  and  the 
element  A(i,J)  =  f  X(  i)  X ( J )  Is  the  correlation  of  t\e  i-th  and  J-th  elements 
of  X.  The  euclidian  norm,  or  length,  or  a  vector  v  in  V  Is 


We  define  the  energy  of  X  as  the  expected  value  of  the  squared  norm  of  X, 
namely 

E(X)  =  E  |  |x| |2  =  EE  X ( J ) 2 

J=1 


Note  that  the  energy  of  X  is  equal  to  the  trace  of  A: 

n  ?  n 

E (X )  =  Z  E(X(J))  =  Z  A  =  tr  A 

J=1  J=1 

Our  approach  to  the  dimension  reduction  problem  is  to  find,  for  any 
k  <  n,  a  k-dimensional  subspace  of  V  which  maximizes  '.he  energy  of 

the  approximation  of  X  by  projection  onto  V^.  It  is  sufficient  to  find 
k  orthonormal  vectors  {tfj  1=1, , . . ,k)  which  span  V^.  By  orth  ) normal,  we 
mean  that 

<t[  ^  =  0  for  i/j 

^  =  1  for  1=1, ...,k 


The  coordinates  of  X  in  are  the  projections  of  X  on  the 

°i  -  x 


and  the  approximation  of  X  in  the  standard  basis  of  V  is 


XK  *  £  °1  *1 


which  permits  reconstruction  of  the  n-space  representation  of  the  approxima¬ 
tion. 

We  define  the  relevance  of  ^  in  representing  X  as  the  mean 
squared  projection  of  X  on 

0j_  -  E(*!  X)2  =  E  (c2) 

Let  be  the  matrix  whose  columns  are  the  Then  the  projection  from 

V  onto  the  6^  basis  of  is  given  by  the  standard  change  of  basis  trans 

h  =  x 


formation 
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and  the*  energy  f  th<*  desired  optimal  approximation  la 

'•MX  )  til  ^  X  ||2  -  l  £  (d!  X)2  -  E  o 

1*1  1«1 

So  maximizing  the  energy  of  approximation  In  a  subspace  Is  equivalent  to  max¬ 
imizing  the*  sum  f  tho  relevances  of  the  baaia  vectors. 

When  k  =  1 ,  with  6^  denoted  by  d,  the  problem  la  to  maximize 

ol  »  £  (d'X)(X'd)  -  d'  l  {XX')  d  -  d'  Ad 

which  la  the  quadratic  form  associated  with  A,  the  autocorrelation  matrix 
of  X.  So  the  normal  vector  d  which  maximizes  o  must  maximize  the  quad¬ 
ratic  form  of  A  subject  to  the  normality  constraint  d'  d  -  1  ■  0.  Using 
the  Lagrange  multiplier  X,  define 

a  -  d'  Ad  -  X  (d1  d  -  1) 

The  vector  of  partial  derivatives  of  a  w.r.t.  the  components  of  d  is, 

2 

according  to  Anderson,  p.  3**7» 

||  ■  2Ad  -  2Xd 

If  d  maximizes  d'Ad  with  d'd  -  1  *  0,  dar/dd  must  be  zero,  which  yields 

Ad  ■  Xd 

the  algebraic  eigenvalue  equation.  Since  A  is  an  utocorrelation  matrix, 

It  Is  positive  semldefinite,  that  is,  all  of  its  eigenvalues  are  non- 
negative.  Premult lplying  by  d\  we  have  d'Ad  =  d'Xd  =  X,  which  implies 
that  c  =  X.  S:  the  maximum  possible  relevance  to  X  of  a  normal  basis 
vector  Is  X ^ ,  the  largest  eigenvalue  of  A  and  that  vector  Is  the  eigen¬ 
vector  d1  of  A  corresponding  to  X^. 

We  extend  this  result  to  arbitrary  k  by  induction.  Having  deter¬ 
mined  dj,...,dj  1  and  E (X ^  j),  we  seek  the  optimal  orthonormal  basis  for 
and  Its  energy  E(X1).  This  basis  must  include  0^,  for  other¬ 

wise  E ( X ^ )  determined  by  the  new  basis  could  be  improved  by  substituting 


the  missing  $ 'a  for  members  of  the  new  basts  orthogonal  to  the  0's  con¬ 
tained  In  Vj_^,  since  the  old  bools  Is  optimal  for  k  ■  1-1.  So  the  problem 
Is  to  maximize  the  additional  energy  E(Xj)  -  E(^  j)  obtained  by  adding  one 
more  basts  vector.  As  before,  this  lc  o^  ■  djAd^,  which  leads  to  the  eigen¬ 
value  oquotlon.  Therefore,  the  maximum  additional  energy  Is  obtained  by 
projection  on  the  eigenvector  d  of  A  corresponding  to  the  largest  remain¬ 
ing  eigenvalue  Xj  of  A,  and 

E(Xj)  -  E(XU1)  ■  Oj  -  X, 

Ve  conclude  that  the  k •dimensional  subspace  Vk  of  V  which  contains 
the  greatest  fraction  of  the  energy  of  X  Is  spanned  by  the  eigenvectors 
dlf...,dk  of  the  autocorrelation  matrix  A  of  X  corresponding  to  the  k 
largest  eigenvalues  The  dj  are  called  the  Intrinsic  basis 

vectors  of  X.  The  energy  of  the  approximation  In  V.  is 


E(xk) 


k 

I 

J-l 


k 

Z 

J-l  11 


where  o 


J 


Is  the  relevance  of  ^  to  X,  A  convenient  measure  of  the  per¬ 


formance  of  the  Intrinsic  analysis  Is  the  fraction  of  the  energy  retained, 
which  is 


E(X  ) 

eHcT 


k 

t  X 

1=L 

trA 


The  above  treatment  is  valid  If  Xj  /  X j  whenever  1  /  J  and  >.j  =  0, 

1  ■  1, . m.  Equal  eigenvalues  determine  an  elgensubspace  of  the  corres¬ 

ponding  dimension  and  within  this  subspace,  the  orthonormal  intrinsic  basis 
vectors  may  be  chosen  arbitrarily.  Similarly,  zero  eigenvalues  determine  a 

subspace  orthogonal  to  the  re3t  of  the  elgenspace.  For  a  detailed  treatment 

2 

of  these  special  cases,  see  Anderson. 

Maximization  of  the  energy  in  the  intrinsic  basis  approximation  Is 
equivalent  to  minimization  of  the  mean  square  error  of  the  approximation.  To 
see  this,  we  observe  that  when  k  ■  n,  the  Intrinsic  basis  approximation  is 


exact,  so  1 1  v  1 1  ■  ||v||.  The  mean  square  error  (the  average  squared  norm 
of  the  error  vector)  Is  then 


e(xk)-E||Xn-xk| 


E  (alx)‘ 

J"k+1  J 


E  (?  (dix)2  -  E  (tflx)2) 

J-l  3  J-l  J 


eII^xII  -  e  I ($k X 


1  „  I  1 2 


E (X )  -  E(X.  ) 


Expressing  the  approximation  energy  in  terms  of  the  eigenvalues,  the  error 
becomes 


c  (X  )  =  E(X)  -  EX,®  trA  -  E 

J*1  J  j=l  J 


and  as  a  fraction  of  total  energy,  the  error  is 


€(XJ 

ElxT 


Ml 

tr  A 


Intrinsic  analysis  corresponds  very  closely  to  a  method  of  multivari¬ 
ate  statistics  called  principal  components  analysis.  The  relationship  be¬ 
tween  these  two  techniques  is  as  follows:  if  the  mean  u  of  X  is  nonzero, 
the  first  intrinsic  basis  vector  may  tend  to  resemble  it.  If  X  is 
symmetrically  distributed  about  u  i  0,  then  cJ,  is  a  scalar  multiple  of  u. 

g  I 

On  the  other  hand,  it  is  noted  by  Colomb,  p.  13,  that  there  exist  distribu¬ 
tions  with  zero  mean  having  the  same  covariance  matrix  as  a  distribution 
with  an  arbitrary  mean  a.  In  practical  situations,  however,  we  may  expect 
that  if  d  is  large,  ^  will  tend  to  have  high  correlations  with  most  of 
the  elements  of  X  and,  therefore,  have  a  strong  similarity  to  u.  In 


« 
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problems  where  interest  is  focused  on  covariances  among  the  elements  of  X 

rather  than  on  cross  correlations,  it  is  desirable  to  eliminate  this  effect 

of  the  mean  vector  by  performing  the  intrinsic  analysis  on  the  variable  X-u, 
which  has  zero  mean.  This  is  called  principal  components  analysis.  The  auto 
correlation  matrix  £  of  X-u  is  called  the  covariance  matrix  of  X.  A 
diagonal  element  of  this  matrix,  g  (X(i)  -  u(i))  is  the  variance  of  X(i); 

a  non-diagonal  element,  g(X(i)  -  u(i))  (X(J)  -  u(j))  is  the  covariance  of 

X ( i )  and  X(J).  With  this  change,  references  to  energy  in  the  development 
for  intrinsic  analysis  may  be  read  as  variance,  which  is  the  energy  of  X-u. 
The  eigenvectors  of  £  are  called  the  principal  components  of  X  because, 
properly  ordered,  they  account  for  the  "principal  components"  of  the  variance 
of  X  about  u. 

At  the  level  of  machine  calculations,  these  techniques  are  easily  in¬ 
terchangeable  because  the  autocorrelation  and  mean  of  X  determine  its  co- 
variance  by  the  following  relation 

E  =  E  (X-u)  (X-u) ' 

=  E(XX'  -  Xu'  -  uX'  +  uu') 

=  E  XX1  -  E  Xu'  -  uEx'  +  uu' 

=  A  -  uu' 

9 

Watanabe  has  shown  that  intrinsic  analysis  satisfies  another  con¬ 
sideration  in  selecting  a  basis  for  information  compression.  This  is  that 
the  relevance  measures  o  ^  should  be  highly  concentrated  on  a  few  cf  the 
basis  vectors  rather  than  spread  out  more  evenly.  If  ']r^,...,\!r  is  any 
orthonormal  basis  of  V  and  we  normalize  X  so  that  E(X)  =  1,  then  c ^ 
is  a  probability  measure  on  with 

n 

>  0  .  E  :  =  1 

~  i=l 


It  is  then  possible  to  introduce  the  entropy  function 

n 

H  (Y)  =  -E  c  log  c 
i=l  1  1 


j 
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which  Is  a  measure  of  the  concentration  of  the  p's.  Watanabe  demonstrates 
that  the  intrinsic  basis  minimizes  the  entropy  function  as  well  as  the  mean 
square  error. 

2 .  Estimates  of  the  Intrinsic  Basis 

In  practice,  the  mean  p.  and  autocorrelation  A  of  the  random  vector  X 
are  seldom  available.  It  is  then  necessary  to  estimate  them  from  a  finite 
sample  {x^,...,xmJ  of  X.  The  estimates  are  simple  averages  involving  the 
sample  vectors 


M- 


X 


i 


-i  l 

m  i=l 


XiXi 


Estimates  of  the  intrinsic  basis  vectors  ^»***»^n  and  their  relevances 
\ . \  are  given  by  the  eigensystem  of  A  (or  of  E  =  A  -  uil'  in  the 

i  n  2 

case  of  principal  components  analysis).  Anderson,  p.  279,  shows  that  if 

the  distribution  of  X  is  multivariate  normal,  then  this  process  defines 

the  maximum  likelihood  estimates  of  0&,  and  , 

in  in 

In  real  problems,  the  distributions  involved  are  usually  not  multi¬ 
variate  normal  and  are  frequently  unknown.  However,  it  is  demonstrated  by 
K.  Miller  in  Ref.  6,  p.  7,  that  A  { or  E)  is  the  minimum  variance  linear 
unbiased  estimate  of  A  (or  E) .  Since  the  eigenvalues  and  eigenvectors  are 
highly  nonlinear  functions  of  A,  it  is  difficult  to  infer  from  this  the 
variance  of  the  errors  in  the  estimated  eigensystems.  Here  we  will  merely 
mention  some  conclusions  of  a  detailed  discussion  of  the  problem  by  Colomb,^ 
pp.  18-24.  First,  the  eigenvalues  are  stable  with  respect  to  perturbations 

A 

of  A  in  that,  under  certain  conditions,  their  variances  are  approximately 
equal  to  the  variances  of  the  diagonal  elements  of  the  error  matrix  A-A. 
Second,  the  eigenvectors  are  very  stable  if  their  eigenvalues  are  well  sep¬ 
arated  "and  as  the  eigenvalues  get  close  together  .  .  .  the  stability  de¬ 
creases  until,  when  the  separation  of  the  values  is  of  the  order  of  the 
perturbation,  no  certain  Information  is  attained  about  any  individual  eigen- 
vect  r."  In  the  specific  aroblems  studied  thus  far,  the  largest  eigenvalues 
are  most  separated  while  eigenvalues  belonging  to  low-energy  subspaces  tend 


s 
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to  cluster  together.  However,  If  the  best  eigenvectors  collectively  are 
used  to  represent  X  (which  is  the  typical  case)  the  errors  in  the  eigen¬ 
vectors  resulting  from  confusion  with  other  eigenvectors  being  used  are  not 
harmful.  Thus,  the  accuracy  of  the  resulting  representation  may  be  better 
than  that  indicated  by  consideration  of  the  eigenvectors  separately.  Fi¬ 
nally  it  is  noted  that  the  ambiguity  in  A  usually  decreases  as  the  square 
root  of  the  number  of  samples  increases. 

Errors  in  the  intrinsic  basis  can  also  result  from  measurement  and 
computational  errors.  For  purposes  of  error  analysis,  these  can  be  treated 

A 

as  part  of  the  ambiguity  of  A.  While  measurement  errors  (sensor  additive 
noise  and  quantization  noise)  are  unavoidable,  they  are  likely  to  be  less 
significant  than  the  statistical  sampling  error  already  mentioned.  In 
implementations  using  a  general  purpose  digital  computer,  the  computational 
error  may  be  reduced  to  insignificance  by  using  an  accurate  eigensystem 
algorithm  such  as  Householder's  Method  (see  Wilkinson,10  pp.  2 90-335)  • 

3.  Approximations  for  Large  Problems 

In  many  information  compression  problems,  the  dimension  n  of  the 
sample  data  is  in  the  hundreds  or  thousands.  Such  high  dimensions  intro¬ 
duce  large  costs  In  terms  of  both  computation  time  and  computer  memory. 

The  time  required  for  estimation  of  the  autocorrelation  matrix  increases 

2  3 

with  n  and  for  calculation  of  its  eigensystem  roughly  with  n  ,  More 

serious,  perhaps,  are  the  random  access  memory  requirements  of  these  proc¬ 
esses.  Unless  the  efficiency  of  the  computations  is  drastically  reduced, 
it  is  necessary  to  retain  the  entire  matrix  in  memory  simultaneously.  Since 
it  is  symmetric,  this  means  that  at  least  n(n+l)/2  locations  are  required. 
Thus  If  there  are  S  available  storage  locations,  the  maximum  dimension 
which  can  be  handled  Is  less  than  v/23.  For  problems  too  large  for  exist¬ 
ing  hardware,  It  is  sometimes  possible  to  obtain  useful  approximations  of 
the  estimated  intrinsic  basis  by  the  procedure  described  by  Roper.11  The 
rest  of  this  section  is  devoted  to  a  summary  of  this  method. 

The  idea  behind  the  approximation  is  to  break  a  large  eigensystem 
problem  up  into  several  small  ones  and  use  their  solutions  to  reduce  the 
size  of  the  problem;  this  is  equivalent  to  a  piecewise  dimension  reduction 
by  intrinsic  analysis.  We  denote  the  n-dimensional  symmetric  matrix  in 
question  (in  this  case,  the  estimated  autocorrelation  matrix)  by  M,  and  its 


true  eigonsystem  by  (A,'?).  We  partition  M  symmetrically  into  p  sub- 
matrices  ,  and  find  the  eigensystems  (A^S^)  of  the  p  diagonal  sub¬ 
matrices  M...  Next  we  discard  all  but  the  k.  most  relevant  eigenvectors 
ii  p  i 

from  each  t.  (T  k  =  k  n) ,  and  form  the  n  X  v  matrix  i  with  the  $ 


1=1 


as  submatrices  along  the  diagonal  and  zeros  elsewhere.  Note  that  $  in¬ 
herits  the  orthonormality  of  the  Therefore,  $  can  be  used  to  trans¬ 

form  M  into  a  k  x  k  dimensional  approximation 


M  =  i  M  i 

whose  eigensystem,  given  by  M  6  =  8  A,  satisfies 

6 1  M  0  =  A 

Combining  these,  we  have 

e'  «'  M  i  6  =  A 

If  we  define 


»  m  e 


the  above  relation  becomes 

M  Y  =  A 

and  (A,  Y)is  the  approximation  of  the  first  k  members  of  (A,  Y).  It  is 

A 

easy  to  show  that  the  columns  of  Y  are  orthonormal,  so  if  M  is  an  es¬ 
timated  autocorrelation  matrix,  the  above  expression  implies  that  the  rel¬ 
evances  of  the  approximate  eigenvectors  (to  the  energy  of  the  sample  vectors) 
are  equal  to  the  correspc nding  approximate  eigenvalues.  This  fact  can  help 
determine  the  value  of  such  approximations  in  dimension  reduction  problems, 
but  unfortunately  it  cannot  tell  us  how  far  we  are  from  the  optimal  repre- 
sentation.  To  minimize  this  deviation,  it  is  advisable  to  make  k  as  large 
as  possible  even  though  fewer  components  may  be  used  in  the  final  representa- 


13 
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One  further  refinement  Is  possible  (and  necessary  If  M  does  not  fit 
In  memory).  In  addition  to  speeding  up  the  eigensystem  calculations.  It 
allows  us  to  reduce  the  time  required  for  computation  of  M.  This  Is  done 

a 

by  computing  directly  only  the  M^.  The  are  then  computed  from  data 

vectors  In  the  reduced  $  basis.  The  savings  realized  are  substantial, 
especially  when  the  number  of  sample  vectors  Is  large.  Details  on  Imple¬ 
mentation  and  further  discussion  of  the  reductions  in  computation  time  may 
be  found  In  Roper.11 


Section  III 


DISCRIMINATION  AMONG  SAMPLE  CLASSES 

In  the  previous  section  we  were  concerned  with  samples  of  a  single 
random  vector.  Here  we  will  consider  several  distinct  classes  of  samples 
drawn  from  random  vectors  with  different  multivariate  distribution  functions. 

The  methods  described  here  are  motivated  by  the  requirements  of  a 
graphics -oriented  data  analysis  facility.  In  the  typical  data  reduction 
and  classification  problem,  the  scientist  or  systems  engineer  is  concerned 
with  whether  the  available  measurements  (components  of  the  sample  vectors) 
are  adequate  to  determine  class  membership,  and  with  which  class  the  meas¬ 
urements  are  most  useful.  For  projection  on  a  computer  display,  a  two- 
dimensional  subspace  of  the  measurement  space  must  be  selected.  Coordinate 
projections  are  not  very  useful  since  they  Ignore  the  information  in  all 
but  two  of  the  measurements.  Intrinsic  basis  representations  embody  in¬ 
formation  from  all  or  most  of  the  measurements,  but  make  no  use  of  class 
membership  information.  The  alternative  suggested  here  is  a  low-dimensional 
representation  for  the  sample  vectors  which  tends  to  cluster  together  samples 
within  each  individual  class  while  emphasizing  the  variations  among  all  the 
classes.  Such  a  projection  gives  the  analyst  the  optional  two-dimensional 
linear  projection  of  his  classification  problem  which  can  yield  insights 
into  its  statistical  characteristicr,  for  example,  the  degree  of  linear 
separability  of  the  sample  classes. 

One  further  motivation  for  using  such  representations  lies  in  the  im¬ 
plementation  of  automatic  pattern  classification  algorithms.  The  performance 
of  most  algorithms  is  improved  if  the  pattern  vectors  are  first  subjected  to 
a  transformation  which  clusters  samples  within  the  same  class. 

This  section  begins  with  a  derivation  of  the  discriminant  analysis 
technique.  There  fellows  a  discussion  of  estimation  and  computation  of  the 
discriminants.  A  serious  computational  problem  arises  when  the  number  of 
sample  vectors  Is  small  relative  to  their  dimension.  A  way  of  circumventing 
this  problem  by  prior  application  of  intrinsic  analysis  is  suggested. 
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1.  Discriminant  Analysis 

Discriminant  analysis,  like  principal  components  analysis,  is  a  tech- 

12 

nique  of  multivariate  analysis  of  variance.  It  is  developed  by  Hotelling 
and  Wilks.'*'  A  detailed  exposition,  with  applications  to  the  data  display 
problem,  is  offered  by  Colomb.^  Ac  In  Section  II,  our  exposition  will  be¬ 
gin  by  treating  random  vectors.  Then,  with  the  theoretical  groundwork  es¬ 
tablished,  it  will  be  expanded  to  include  finite  samples  of  them. 

Let  X^,...,X  be  n-dimensional  random  vectors  with  mean  vectors 
and  covariance  matrices  Each  random  vector  X.^  is 

identified  with  a  sample  class  The  wlthln-class  covariance  matrix, 

which  describes  covariance  about  class  means,  is  obtained  by  averaging  the 
covariance  matrices  of  the  classes 


W 


i 


the  grand  mean  is  the  average  of  the  class  means 


u 


u 


i 


and  the  among-classes  covariance  matrix,  which  describes  covariance  of  the 
class  means  about  the  grand  mean,  is 


.11  .  ! 

A  =  -  2  u.u,  -  uu. 

r  1=1  11 

Principal  components  analysis  produces  vectors,  or  directions,  in  which  the 
covariance  of  a  random  vector  is  maximized.  Discriminant  analysis  finds 
vectors  which  maximize  the  among  groups  covariance  while  minimizing  the  with¬ 
in  groups  covariance.  From  the  discussion  of  intrinsic  analysis,  we  recall 
that  the  relevance  of  a  vector  0  to  the  covariance  described  by  2  is 
0*20.  Thus  we  want  to  find  a  discriminant  vector  d  which  maximizes 

2 

_  d 'Ad 

o2  d !Wd 

w 
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the  ratio  of  the  relevances,  or  variances  in  the  direction  of  d.  This  can 

2  2 

be  accomplished  by  maximizing  aA  with  held  constant.  Again  using  X 

to  denote  a  Lagrange  multiplier,  a  necessary  condition  for  this  maximization 
is 

^  |  d ’Ad  -  X'd'Wd  -  constant)j  =  0 


2  Ad  -  X  2  Wd  =  0 


or 


Ad  =  X  Wd 

This  is  the  generalized  algebraic  eigenvalue  equation.  The  number  of  dis¬ 
tinct,  non-trivial  solutions  (discriminant  vectors)  is  equal  to  the  rank 
of  A,  which  is  r-1  (the  number  of  classes  minus  one).  The  subspace 
spanned  by  d^, . .  •»dr_>1  is  called  the  discriminant  space.  (Since  the  dis¬ 
criminant  vectors  are  not  necessarily  orthonormal,  it  is  desirable,  in  prac¬ 
tice,  to  orthogonalize  them  to  obtain  an  orthonormal  basis  for  the  discrim¬ 
inant  space.)  Finally,  we  note  that  since  we  can  pre-multiply  the  above  by 
d'  to  obtain  d'Ad  =  d'XWd,  we  have 


2 

d'Ad  _  aA 

K  "  cFWd  "  2 

o’,.. 


Therefore,  the  eigenvalues  indicate  the  ratio  of  the  among-class  to  the  with- 
in-class  variances  of  projections  on  the  corresponding  discriminants. 

2 .  Discriminant  Computations 

One  of  the  advantages  of  intrinsic  analysis  is  its  ability  to  reduce 
a  high  dimensional  set  of  data  with  redundant  measurements  to  a  lower  dimen¬ 
sional  representation.  Provided  that  the  number  of  samples  available  be  suf¬ 
ficient  to  provide  a  reasonable  estimate  of  the  intrinsic  basis.  The  situa¬ 
tion  is  quite  different  with  discriminant  analysis.  As  we  shall  see,  the 
discriminant  computations  are  impossible  if  the  number  of  samples  is  less 


than  their  dimension 


Suppose  the  random  vectors  i=l,..,r  are  represented  by  sets  of 

sample  vectors 


were  m 


r 


1=1 


m 


1 


the  means  and  covariance  matrices  of  the  sample  classes  are  estimated  by 


m^ 

Ui  =  m^  XiJ  5  Wi  =  Xij  X  ij  "  Uiai 


The  grand  mean  and  among-groups  covariance  matrix  are  estimated  by 


1  r  x  r 

u  =  -  £  mu.  ;  A  =  -  Z  m.u.n'  -  uu' 
m  11  m  ill 


and  the  within-groups  covariance  matrix  by 


W  =  -  E  m.W. 
m  i=i  11 


Estimates  of  the  discriminants  are  the  solutions  of  generalized  eigenvalue 
equation 


Ad  =  6  Wd 


which  may  be  reduced  to  the  ordinary  eigenvalue  equation  by  a  fast  and  ac¬ 
curate  process  given  by  Wilkinson,  u  pp.  337-^0.  For  details  of  the  com¬ 
putations  and  a  discussion  of  the  numerical  errors  involved,  the  reader  is 
referred  to  Colomb,1^  pp.  18-20.  The  fact  which  concerns  us  here  is  that 
the  reduction  requires  what  amounts  to  an  inversion  of  W.  The  rank  of  W 
is  less  than  or  equal  to  m-r.  So  if 
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where  n  Is  the  dimension  of  the  samples,  W  Is  singular  and  its  Inverse 

14 

does  not  exist,  Streeter  and  Ravlv,  p.  l6,  have  experimentally  evaluated 
three  different  means  of  avoiding  this  singularity  problem,  Including  the 
Moore-Penroso  generalized  Inverse  and  the  'H '  Inverse  of  T,  Harley. 

Both  of  these  approaches  Introduce  artificial  constraints  on  the  solution 
and  are  cumbersome  to  Implement.  The  third  approach,  suggested  by  Streeter 
and  Ravlv,  gave  the  best  results.  Their  Idea  Is  to  use  Intrinsic  analysis 
to  reduce  the  dimension  of  the  samples  so  that  the  discriminants  may  be 
computed. 

It  has  been  found  convenient  to  use  principal  components  analysis  rather 
than  Intrinsic  analysis.  The  principal  components  are  eigenvec¬ 

tors  of  the  total  covariance  matrix  of  all  the  samples  about  the  grand  mean 


T 


1 

m 


r  1 
r  Z 
1-1  J-l 


XiJXiJ 


i 

uu 


It  can  easily  be  shown  that 

T  =  A  +  W 


that  Is,  the  total  covariance  equals  the  among-c lasses  covariance  plus  the 
within-class  covariance. 

The  samples  are  represented  using  the  first  k  principal  components 
as 


x 


U 


where 


K  T  *k  -  \ 

the  approximate  covariance  matrices  A  and  W  are  then  computed  as  before 
using  the  x^,  and  the  new  discriminant  vectors  are  the  solutions  of 

Ad  =  *  d 


* 
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Even  when  It  is  not  mathematically  necessary,  tnis  approximation  may 
bo  valuable  in  reducing  the  storage  requirements  for  discriminant  analysis, 
which  are  roughly  twice  those  of  principal  components  analysis,  or  because 
of  the  time  savings  resulting  from  the  lowered  dimension  of  the  discrimi¬ 
nant  problem.  Those  savings  are  detailed  by  Colomb,1^  pp.  20-23,  who  shows 
how  to  obtain  further  savings  by  taking  advantage  of  the  fact  that 

This  follows  because  A  +  W  =  T,  which,  in  the  principal  components  basis, 
is  A^.  This  makes  it  possible  to  avoid  direct  calculation  of  W  and  even 
the  transformation  of  the  sample  vectors,  since  A  can  be  obtained  by  trans 
forming  only  the  class  means. 

The  principal  components  approximation  can  also  be  useful  in  improv¬ 
ing  the  accuracy  of  the  discriminant  vectors.  Even  when  W  is  not  singu¬ 
lar,  it  may  be  very  ill-conditioned.  (The  condition  number  of  a  symmetric 
matrix  is  the  ratio  of  its  largest  and  smallest  eigenvalues.)  An  ill- 
conditioned  W  may  introduce  instabilities  into  the  inversion  process  and 
errors  in  the  resulting  discriminant  vectors  and  values.  The  discriminant 
equation  may  be  rewritten 


Ad  =  5  (T-A)d 


or 


Ad  =  —r-  Td 
1-0 

Solution  of  this  form  requires  inversion  of  T,  but  in  the  principal  compo- 

A 

nents  basis,  T  =  is  diagonal.  Thus,  by  an  appropriate  choice  of  the  re¬ 
duced  dimension  k,  we  may  ensure  that  f  has  any  required  condition  num¬ 
ber.  The  only  remaining  question  is  that  of  the  optimal  choice  of  k.  This 
problem  is  treated  in  great  detail  by  Colomb.1^ 


Section  IV 


METHODS  OF  DATA  CLASSIFICATION 

In  many  instances  the  ultimate  goal  of  the  information  compression 
and  discrimination  techniques  described  above  is  the  efficient  implementa¬ 
tion  of  a  procedure  for  data  classification  or  sorting.  We  are  concerned 
here  with  classification  of  observations  into  one  of  several  previously 
known  categories.  The  pattern  recognition  problem  has  been  formulated  in 
several  disciplines,  including  information  theory,  switching  theory  and 
control  theory,  and  an  informative  survey  of  the  field  is  offered  by  Nagy. 

We  will  review  here  the  usual  formulation  in  terms  of  statistical  decision 

17 

theory.  Works  of  general  interest  in  this  area  include  Sebestyen, 
Highleyman1^  and  Nilsson.1^  The  treatment  by  Nilsson  is  most  convenient 
and  will  be  drawn  on  heavily  here.  We  will  discuss  the  Implementation  of 
pattern  classifiers  using  discriminant  functions,  some  optimal  methods  for 
normally  distributed  patterns,  and  the  problem  of  estimating  unknown 
multivariate  probability  density  functions  (called  densities,  for  conveni¬ 
ence)  from  finite  training  samples.  We  will  review  several  approaches  to 
this  problem  and  discuss  the  most  promising  in  some  detail. 

1.  Bayes  Optimal  Decision  Rules 

Statistical  decision  theory  provides  a  means  of  specifying  rules 
for  pattern  classification  which  are  optimal  in  the  sense  of  minimizing 
average  losses  due  to  incorrect  classification.  A  good  treatment  of  this 
approach  Is  found  in  Robbins,2^  We  assume  the  existence  of  r  pattern 
categories  with  a  priori  probabilities  of  occurrence  p(i),  for 

1=1,... ,r.  We  must  also  specify  a  loss  function  X( i | J )  which  represents 
the  less  resulting  from  assigning  to  category  i  a  pattern  which  actually 
belongs  to  category  J.  Using  the  loss  function,  the  conditional  average 
loss  of  assigning  pattern  X  to  category  i  is  defined  by 

L  (X)  =  I  i(i|j)  p ( J I X ) 

1 


20 


I 


21 

where  p(j|x)  is  the  probability  that,  given  X,  its  category  is  actually  j. 
The  average  loss  is,  therefore,  minimized  by  assigning  each  X  to  the  cate¬ 
gory  i  for  which 


Li  (X)  <  L1(X)  for  i  =  1,  ...,r 

o 

Such  a  decision  rule  is  called  a  Bayes  strategy.  Using  Bayes1  rule  we  may 
write 


p(j|x)= 


p(x|j)  p(J) 

P(X) 


where  p(x|j)  is  the  density  function  of  category  J  evaluated  at  X. 
The  conditional  average  loss  then  becomes 


r 


*(i|j)  p(x| j)  p(j) 


Since  P(X)  is  independent  of  i,  it  need  not  be  evaluated  in  minimizing 
(X ) .  It  remains  to  evaluate  the  probability  density  functions  p(x|j) 
of  each  category  at  X.  This  is  the  central  problem  of  pattern  classifi¬ 
cation.  In  some  instances  the  losses  incurred  by  all  misclassifications 
are  equal.  This  situation  is  described  by  the  symmetric  loss  function 

i{i|  j)  =  1  -  5tJ 

which  is  zero  for  correct  classifications  and  one  otherwise.  The  problem 
is  then  reduced  to  minimizing 


L^X)  =  1 


P(x|i)  p(i) 

" ~p(x) 


which,  if  all  categories  are  equally  likely,  is  equivalent  to  maximizing 
p(x|i).  Such  a  rule  is  called  a  maximum  likelihood  decision  rule. 

The  Bayes  strategy  may  be  explicitly  implemented  only  if  the  p(X | J ) 
are  already  known.  In  practical  situations,  this  is  not  the  case  and  the 
densities  must  be  estimated  from  samples  of  the  categories.  Their  functional 
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form  io  sometimes  known  in  advance  (or  more  often  assumed).  In  this  case  the 
problem  is  reduced  to  estimating  the  parameters  of  the  density  functions. 

This  is  called  the  "parametric"  approach.  For  certain  forms,  notably  the  mul¬ 

tivariate  normal,  convenient  realizations  of  the  optimal  decision  rules  have 
been  derived,  several  of  which  are  described  in  this  section.  The  weakness 
of  this  approach  is  that  actual  density  functions  do  not  usually  conform  to 
the  assumed  forms;  this  can  result  in  badly  suboptimal  decision  rules.  An 

example  of  this  is  the  case  where  the  actual  densities  are  multimodal. 

The  opposite  "nonparametric"  approach  makes  no  assumptions  about  the 
density  functions,  except  that  they  are  reasonably  smooth,  and  it  approximates 
the  entire  function  from  sample  patterns.  These  approximations  become  impracti¬ 
cal  as  the  dimension  of  the  pattern  vectors  increases,  because  their  domain 
is  the  pattern  space  and  the  number  of  points  involved  in  a  discrete  approx¬ 
imation  increases  exponentially  with  the  dimension.  The  problem  is  sim¬ 
plified  if  the  components  of  the  pattern  vectors  are  statistically  independent, 
for  then 

n 

p(x|j)  -IT  P(x  |j) 
k=l  * 

where  n  is  the  dimension.  Here  we  need  only  approximate  n  univariate 
densities  for  each  category.  If  the  components  are  not  independent,  but  the 
categories  have  equal  covariance  matrices,  then  new,  independent  variables 
may  be  found  by  diagonalizing  the  covariance  matrices  (see  Section  II).  How¬ 
ever,  there  appears  to  be  no  general  solution  to  the  approximation  problem. 

Most  practical  schemes  approximate  the  densities  indirectly  through  the  use 
of  discriminant  functions  (see  below)  which  are  equivalent,  in  terms  of 
classification,  to  certain  density  function  approximations.  This  problem, 
and  various  attempts  to  solve  it,  are  discussed  in  greater  detail  under 
"Nonparametric  Methods." 

2.  Discriminant  Functions  and  Decision  Surfaces 

The  theoretical  foundation  for  the  concept  of  discriminant  functions 
and  their  role  in  pattern  classification  is  summarized  below.  (The  discrim¬ 
inant  functions  of  decision  theory  should  not  be  confused  with  the  discrim¬ 
inant  vectors  of  multivariate  analysis  of  variance,  treated  in  Section  III.) 
Geometrically,  a  pattern  classification  rule  is  equivalent  to  a 
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partition  of  the  pattern  space  into  disjoint  regions  corresponding  to  the 
categories.  These  regions  are  called  decision  regions  and  the  surfaces  sep¬ 
arating  them  are  called  decision  surfaces.  The  decision  regions  R^,.,,,R 
of  any  r -category  pattern  classifier  may  be  implicitly  defined  by  a  set  of 
r  discriminant  functions  g^(X), . . .,g  (X)  which  satisfy,  for  every  X  in 


gi(x)-gj(x)  for  1»J=1 


Then  the  decision  surface  separating  R^  from  Rj 


g±(x)  -  gj(X)  =  0 


,  ...,r 

is  given  by 


For  example,  the  discriminant  functions  which  determine  the  decision  regions 
of  the  Bayes  strategy  are  the  negatives  of  the  average  loss  functions: 

-L1(X). 

Discriminant  functions  are  widely  used  in  pattern  recognition  applica¬ 
tions  because  of  their  relative  ease  of  implementation.  However,  as  the  prob¬ 
ability  densities  of  the  pattern  classes  become  more  complex,  so  do  the  cor¬ 
responding  optimal  discriminant  functions.  Therefore,  much  work  in  pattern 
recognition  theory  has  been  devoted  to  finding  suboptimal  discriminants  of 
simple  forms  which  closely  approximate  the  performance  of  optimal  functions. 

It  is  desirable  to  consider  families  of  discriminant  functions  where  members 
are  determined  by  a  modest  number  of  parameters  (which  must  be  stored  in  any 
implementation  of  the  resulting  classifier).  There  is  a  useful  class  of 
function  families  whose  members  depend  linearly  on  their  weights.  Such  dis¬ 
criminant  functions  are  referred  to  by  Nilsson  as  *  functions  and  may  be 
written 


$(X)  =  wQ  +  fx(X)  +  ...  +  wMfM(X) 

where  the  f^(X),  1=1,..., M,  are  linearly  independent,  real,  single  valued 
functions  independent  of  the  weights.  The  number  of  weights  M  +  1,  which 
determine  the  *  function,  is  called  the  number  of  degrees  of  freedom  of 
the  family.  We  will  consider  here  only  $  function  families  whose  members 
are  polynomial  functions  of  the  components  of  X.  The  potential  performance 
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of  polynomial  discriminant  functions  increases  with  the  degree  of  the  poly¬ 
nomials,  but  so  does  the  number  of  weights  necessary  to  implement  them.  In 
reasing  the  maximum  degree  of  the  polynomials  from  d-1  to  d  adds 
j  possible  degrees  of  freedom  when  n  >  d.  It  is,  therefore, 
necessary  to  limit  the  maximum  degree  d  of  the  polynomials. 

When  d  =  1,  we  have 


fact,  inc 
/  d  +  n  - 

V  n  -  1 


f1(X)  =  x1  for  i=l, . . .  ,n 


and  the  $  functions  are  linear  in  the  components  of  X,  with  n+1  degrees 
of  freedom.  For  d-2,  the  f  are  of  the  form 

P1  P2 

f  (X)  =  x  x  for  k  ,k?  =  l,...,n 

1  ^2  X  ^ 

p1,p2  =  0,1 


resulting  in  quadric  f  functions  with  (n+1)  (n+2)/2  degrees  of  freedom. 
In  general,  the  f^  are  d-th  degrees  polynomials 


for  kj, . . ,,kj  =  1, . . . ,n 
pl*  ’  *  "pd  =  ^ 


and  the  •  functions  are  general  d-th  degree  polynomials  in  the  components 
of  X  with  degrees  of  freedom.  Note  that  at  least  in  theory,  an  ar¬ 

bitrary  continuous  discriminant  function  (for  example,  the  likelihood  func¬ 
tions  P(x|i))  may  be  approximated  to  any  desired  degree  of  accuracy  by  ap¬ 
propriate  choice  of  d.  Later  we  will  realize  the  usefulness  of  this  fact. 

It  is  instructive  to  consider  the  shape  of  the  decision  regions 

defined  by  polynomial  discriminant  functions.  Since  categories  are  assigned 

by  finding  the  maximum  g,(X),  the  decision  surface  separating  regions  R 

and  R,  satisfies 
b 


ga(X)  -  gb(X)  =  0 


Since  linear  decision  functions  have  the  form  g(X)  =  wQ  +  w^  + 
this  is 


W  X  , 

n  n 


•  •  • 


« 


V  X1  +  ”•  +  1  Xr.  +  (V  •  V  =  0 

1  n  n  0  0 
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Thus  each  decision  region  of  a  linear  pattern  classifier  is  convex  and  is 
bonded  by  no  more  than  r  -  1  hyperplanes  of  dimension  n  -  1. 

A  quadric  discriminant  function  has  the  form 

d  d  d 

g(X)  =  w  +  I  w  x  +  I  E  q,,x  x 
J=1  J  J  J=1  k=j  J 

and  is  determined  by  its  1  +  n  +  n(n-l)/2  weights.  This  can  be  expressed 
in  matrix  form  as 


g(X )  -  wQ  +  W'X  +  X'QX 

where  W  is  the  vector  whose  elements  are  the  linear  weights  and  Q  is 
the  symmetric  matrix  whose  elements  are  the  weights  of  the  quadratic  terms. 
So  the  quadric  decision  surface  separating  and  satisfies 


X’«Ja  -«*,}*  +  (Wa  -V’X  +  (“a  -  V>  * 


The  shape  of  the  quadric  surface  depends  upon  the  quadratic  formX'(Qa  -  Q^X. 
If  (Qa  -  Q^)  is  positive  (or  negative)  definite,  the  surface  is  called  a 
hyperellipsoid;  if  (Qa  -  Q^)  is  not  positive  (or  negative  definite,  the  sur¬ 
face  is  called  a  hyperhyperboloid.  In  general,  polynomial  discriminant 
functions  of  degree  d  result  in  d-th  order  decision  surfaces  in  the  pat¬ 
tern  space. 

The  higher  the  order  of  the  optimal  polynomial  decision  surface,  the 
better  its  ability  to  separate  pattern  categories  with  complex  distributions. 

A  set  of  categories  which  can  be  correctly  identified  by  linear  decision 
functions  is  called  linearly  separable.  A  ramily  of  $  functions  Is  deter¬ 
mined  by  its  component  functions  f^X),  i=l,...,M.  These  component  functions 
can  be  used  to  define  a  transformation 


F(X)  =  (f:(X) . fM(X)) 

from  the  pattern  space  into  an  M-dimensional  space  called  the  ?  space. 
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Thus  a  decision  surface  in  the  pattern  space  implied  by  a  given  set  of  $ 
functions  has  corresponding  to  it  a  hyperplane  in  the  $  space.  If  a  set 
of  categories  is  correctly  identified  by  the  functions,  it  is  linearly 
separable  in  the  $  space.  In  Section  V  we  will  see  that  the  feature  ex¬ 
traction  problem  may  be  viewed  as  the  choice  of  an  appropriate  set  of  $ 
functions . 

3.  Discriminant  Functions  for  Normal  Populations 

If  the  probability  density  functions  of  the  categories  are  multivari¬ 
ate  normal,  the  optimal  discriminant  functions  for  a  symmetric  loss  function 
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may  be  derived  explicitly  (see  Nilsson,  p.  55)*  They  are 

gl(X)  =  wQ  +  |(X  -  u±) '  Ej1  (X  -  Uj) 

where  and  are  the  mean  vector  and  covariance  matrix  of  category  i, 

and  wQ  =  In  p(i)  -  l/2  lnl^J,  for  i=l,...,r.  So  the  optimum  discriminant 
functions  for  normal  patterns  are  quadric. 

In  the  case  of  equal  covariance  matrices  these  can  be  reduced  to 
linear  discriminant  functions.  Furthermore,  if  the  covariance  matrices  are 
the  identity  and  the  a  priori  probabilities  are  equal,  the  (maximum  likeli¬ 
hood)  discriminants  are  given  by 

g4(X)  =  X'  -  1/2  for  i  =  1,  ...,r 

Notice  that  equivalent  classifications  are  obtained  by  minimizing  the  squared 
distance  from  X  to  u^,  which  is 

dist2  (X,u.±)  =  (X  -  u1)'  (X  -  uA)  =  X'X  -  2  X'  +  ^u1 

because  X'X  is  constant  over  i,  and  so  may  be  eliminated.  These  linear 
discriminants  are  widely  used  when  little  is  known  about  the  distributions  of 
the  patterns  because  they  satisfy  the  intuitive  notion  that  unknown  patterns 
should  be  assigned  to  categories  (represented  by  the  means)  to  which  they 
are  close.  Due  to  these  considerations,  we  will  use  the  minimum  distance 
criterion  to  demonstrate  empirically  the  value  of  intrinsic  analysis  and  dis¬ 
criminant  analysis  for  pattern  classification  in  an  experiment  described  in 
Section  V. 
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4.  Nonparametrlc  Approaches 

We  now  return  to  the  problem  of  approximating  arbitrary  multivariate 
density  functions  from  training  samples.  As  we  have  seen,  if  the  components 
of  the  patterns  are  statistically  independent,  these  can  be  written  as 
products  of  univariate  densities.  But  the  independence  assumption  is  not 
usually  Justified,  so  we  must,  in  effect,  estimate  the  Joint  probabilities  of 
the  components  of  each  possible  pattern  vector.  For  high  dimensional  pattern 
spaces  this  is  very  impractical  and  it  is  necessary  to  find  some  means  of 
representing  the  densities  indirectly.  A  productive  technique  for  binary  pat¬ 
terns  is  reported  by  Chow.21  Here  the  pattern  space  B  comprises  the  2n 
vertices  of  an  n-dimensional  cube;  the  density  functions  are  expanded  as  lin¬ 
ear  combinstions  of  Walsh-Rademacher  functions,  which  form  a  complete  ortho- 
normal  basis  for  the  space  of  real  valued  functions  on  B. 

For  continuous  patterns  the  density  function  space  becomes  infinite 

dimensional.  Various  formal  expansions  for  the  continuous  case  have  been  pro- 
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posed,  for  example,  using  Laguerre  polynomials  (Krishnomoorthy  ),  but  they 
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are  quite  impractical.  Kanal,  pp.  4-20,  reviews  the  problem  of  constructing 
orthonormal  expansions  and  concludes:  "In  the  multivariate  case  we  are  really 
faced  with  the  curse  of  dimensionality  and  the  prospect  of  constructing  prac¬ 
tical  systems  for  adaptively  approximating  likelihood  functions  based  on  or¬ 
thogonal  expansions  seems  dim." 

.  Another  alternative  is  to  give  up  direct  estimation  of  densities  and 
adopt  a  classification  procedure  which  deals  with  the  sample  patterns  di¬ 
rectly  and  only  implicitly  involves  the  densities.  Perhaps  the  most  straight¬ 
forward  approach  of  this  type  is  the  "nearest  neighbor  decision  rule"  by 
which  an  unknown  pattern  is  assigned  to  the  category  containing  the  training 
sample  closest  to  it  according  to  some  metric  defined  on  the  pattern  space. 
This  is  equivalent  to  the  minimum  distance  criterion  already  describe,  with 
each  sample  point  defining  its  own  subcategory.  The  resulting  decision  sur¬ 
faces  are  piecewise  linear  and  will,  in  general,  perform  better  than  the  op- 
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timal  linear  boundaries.  It  has  been  shown  by  Cover  and  Hart  that  the  er¬ 
ror  rate  of  this  rule  is  at  most  twice  that  of  the  Bayes  optimal  classifier 
for  an  infinitely  large  training  set.  Of  course,  the  problem  is  that  as  the 
number  of  training  samples  increases  it  becomes  impractical  to  compute  dis¬ 
tances  to  all  of  them.  One  method  which  is  frequently  used  to  overcome  this 
difficulty  involves  partitioning  the  samples  into  subcategories  which  tend 
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to  cluster  together.  Modal  points  (typically  means)  of  these  subcategories 
are  then  used  to  Implement  the  nearest  neighbor  rule.  See,  for  example, 
Flrsehien  and  Flsehler.'  This  "mode  seeking"  approach  can  be  very  produc¬ 
tive  but  some  care  must  be  taken  In  the  selection  of  clustering  algorithms 
and  their  parameters  for  specific  problems.  In  fact,  according  to  Sammon, 
p.  11,  the  performance  of  all  known  clustering  algorithms  Is  so  sensitive 
to  the  settings  of  their  parameters  that  "the  proper  setting  usually  can  only 
be  determined  by  a  trial  and  error  method." 
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Closely  related  to  this  is  the  attempt  by  Sebestyen  to  estimate  an 
arbitrary  density  function  as  the  mean  of  a  small  number  of  normal  densities 
approximated  from  subcategories.  Besides  ad  hoc  rules  for  "adaptive  sample 
set  construction,"  this  approach  involves  division  of  the  pattern  space  Into 
cells  and,  therefore,  runs  into  difficulty  as  the  dimension  increases.  The 
fundamental  assumption  of  both  the  mode  seeking  and  adaptive  sample  set  con¬ 
struction  methods  Is  that  the  densities  In  question  can  be  well  represented 
as  the  sum  of  symmetric  normal  densities.  Thu3  they  are  particularly  effec¬ 
tive  In  handling  multi-modal  densities. 

The  Idea  of  approximating  density  functions  as  means  of  normal  den¬ 
sities  is  carried  to  its  logical  extreme  in  an  elegant  technique  proposed  by 
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Specht,  who  generates  a  symmetric  density  of  normal  form 


gt(X) 
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(2n )n/2an 


exp 
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about  each  sample  pattern  S^,  These  "interpolation  functions"  are  aver¬ 
aged  over  all  patterns  in  the  training  set  to  obtain  the  approximation.  It 
is  shown,  p.  31,  that  as  the  number  of  samples  becomes  infinite,  and  as  the 
"smoothing  parameter"  1  -  0,  the  approximation  converges  to  the  true 
density  wherever  it  is  continuous.  In  order  to  evaluate  the  approximation, 
the  exponentials  in  the  interpolation  functions  are  written,  using  the  series 
expansion,  as  polynomials  in  the  components  of  X.  The  truncated  expansions 
may  then  be  used  in  d-th  degree  polynomial  discriminant  functions  to  im¬ 
plement  a  Bayes  strategy.  This  is  referred  to  as  the  "polynomial  discrimi¬ 
nant  method," 

It  Is  interesting  to  note  that  as  <?-•«,  the  resulting  decision  rule 
becomes  the  minimum  distance  classifier,  and  as  rj  —  0  it  becomes  the  nearest 
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neighbor  rule;  the  corresponding  decision  surfaces  range  from  strictly  linear 
to  highly  nonlinear.  In  practice,  the  shape  o  ‘  the  decision  surfaces  also 
depends  on  the  degree  d  of  the  truncated  polynomial  approximation.  (S^-e 
Section  IV-2  above.)  Generally  speaking,  the  higher  the  degree,  the  larger 
the  value  of  n  necessary  to  obtain  an  adequately  smooth  approximation. 

This  method,  like  all  others,  is  subject  to  "the  curse  of  dimension¬ 
ality."  Specht  shows  that  the  number  of  training  samples  required  to  obtain 
approximations  of  a  given  quality  increases  exponentially  with  the  dimension 
of  the  pattern  space.  (It  also  increases  as  c  is  made  smaller.)  Neverthe¬ 
less,  the  polynomial  discriminant  method  seems  to  be  applicable  when  only  a 
small  number  of  training  samples  is  available.  In  fact,  in  an  experiment 
involving  separation  of  normally  distributed  categories,  it  actually  out¬ 
performed  the  optimal  (quadratic)  classifier  (see  Section  IV-3)  based  on  es¬ 
timates  of  the  means  and  covariances.  In  the  experiments  eight  samples  were 
drawn  from  each  category,  but  since  their  dimension  was  only  five,  it  may  not 
be  valid  to  extrapolate  the  results  to  higher  dimensions. 

Finally,  the  polynomial  discriminant  method  shares  the  practical  ad¬ 
vantages  of  matched  filter  methods  over  most  other  techniques.  The  coefficients 
of  the  polynomials  are  simple  averages  of  the  corresponding  coefficients  con¬ 
tributed  by  each  training  sample.  Thus  the  classifier  may  be  made  adaptive 
simply  by  updating  the  discriminant  functions  as  new  samples  are  obtained. 

Also,  unlike  iterative  techniques,  only  one  look  at  each  sample  is  required. 

The  classifier  can  adapt  to  time  varying  statistics  if  exponential  smoothing 
is  used  to  update  the  coefficients.  From  both  the  practical  and  theoretical 
viewpoints,  Specht's  method  is,  in  this  author's  opinion,  the  most  promising 
nonparametric  approach  to  Bayes  optimal  classification. 


Section  V 


INFORMATION  COMPRESSION  APPLIED  TO 
DATA  CLASSIFICATION  PROBLEMS 

In  the  previous  Section  we  considered  the  pattern  classification 
problem  in  isolation.  The  design  of  a  system  for  pattern  recognition  gen¬ 
erally  includes  two  other  stages:  feature  extraction,  the  problem  of  what 
measurements  to  use,  and  optimization  of  system  parameters.  Since  the  op¬ 
timal  parameters  are  dependent  on  the  statistical  properties  of  the  data, 
they  are  usually  estimated  empirically;  this  problem  will  not  be  discussed 
further.  This  Section  considers  the  feature  extraction  problem  and  the 
applicability  of  principal  components  analysis  and  discriminant  analysis. 
Experimental  results  are  described  in  which  both  methods  were  used  as  feature 
extractors  for  a  minimum  distance  classifier. 

1 .  The  Feature  Extraction  Problem 

Feature  extraction  is  the  process  of  selecting  a  relatively  small 
number  of  measurements  or  combinations  of  measurements  which  tend  to  de¬ 
scribe  the  characteristic  features  of  the  pattern  classes.  There  are  two 
basic  goals:  (a)  minimizing  the  number  of  features  and  the  resulting  dimen¬ 
sion  of  the  classifier,  and  (b)  finding  features  which  determine  a  space  In 
which  the  members  of  each  pattern  class  will  tend  to  cluster  together,  thus 
improving  the  performance  of  the  classifier  or  making  it  possible  to  use  a 
simpler  algorithm.  In  some  instances  physical  considerations  will  indicate 
an  appropriate  choice  of  measurements  and  feature  extraction  is  primarily 
an  engineering  problem.  Our  consideration  here  is  restricted  to  the  situa¬ 
tion  in  which  a  well  defined  set  of  sensor  measurements  already  exists  and 
'he  problem  is  to  select  features  from  these  measurements.  In  this  context 
feature  extraction  may  be  thought  of  as  a  mapping  from  the  measurement  space 
into  a  "feature  space"  which  accomplishes  either  or  both  of  the  above  goals. 
In  Section  IV-2  we  saw  that  the  component  functions  of  a  i  function 
family  determine  a  mapping  from  the  measurement  space  into  the  f  space,  in 
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which  the  decision  regions  are  linear.  The  choice  of  good  component  functions 
may  thus  be  regarded  as  feature  selection  if  it  improves  the  performance  of  a 
linear  classifier  in  the  $  space.  Consider,  for  example,  the  polynomial 
discriminant  function  of  Section  IV-11.  The  terms  of  the  polynomials  are  the 
component  functions.  Goal  (b)  of  feature  extraction  is  satisfied  by  these 
polynomial  terms,  because  as  the  degree  of  the  polynomials  increases  the 
Bayes  optimal  classifier  is  more  nearly  approximated.  But  goal  (a)  is  not, 
because  the  number  of  terms  increases  rapidly  with  the  degree  of  the  poly- 
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nomials  so  that  the  number  of  measurements  is  actuolly  increased.  Specht, 
however,  proposes  methods  for  eliminating  terms  which  are  least  useful  in 
classification. 

The  feature  extraction  problem  does  not  lend  itself  to  a  general  so¬ 
lution.  This  is  partly  because  the  goodness  of  the  features  can  ultimate  / 
be  Judged  only  on  the  performance  of  the  recognition  system,  which  depends 
also  on  the  classification  algorithm  used.  Another  difficulty  is  the  un¬ 
bounded  number  of  feature  space  transformations  which  are  possible.  If  we 
consider  only  selection  of  measurements,  for  example,  there  are  (£1  possible 
subsets  of  p  measurements  chosen  from  a  set  of  n.  Some  workers 'report 
success  by  simply  choosing  random  subsets  of  redundant  measurement  sets. 
Another  approach  is  to  define  some  measure  of  the  information  content  of 
each  measurement  relative  to  the  classification  of  training  categories. 
Measurements  are  then  selected  which  have  the  largest  information  content. 
Since  the  above  techniques  treat  measurements  separately,  they  ignore  the 
Joint  densities  of  the  measurements.  A  nonparametric  method  for  evaluating 

measurement  subsets  which  does  consider  the  Joint  densities  is  proposed  by 
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Fu.  This  method  employs  direct  estimation  from  multivariate  density 
estimates  of  the  error  probability  of  a  particular  measurement  subset;  how¬ 
ever,  it  offers  no  guidance  Tor  the  choice  of  prospective  subsets.  For 
more  detailed  discussions  of  feature  extraction  and  references,  consult  Fu 
or  Nagy,  ^  pp.  8 52 -85^ .  In  the  remainder  of  this  Section,  we  consider 
the  application  of  intrinsic  analysis  and  discriminant  analysis  to  the 
feature  extraction  problem. 

2.  Application  of  Principal  Components  Analysis  and  Discriminant  Analysis 

The  linear  dimension  reduction  and  data  discrimination  techniques  re¬ 
viewed  in  Sections  II  and  III  find  useful  application  in  feature  extraction. 
They  produce  linear  transformations  which  may  be  applied  to  the  measurement 


space  to  obtain  reduced  dimension  and  improved  performance  of  pattern  classi¬ 
fiers,  or  at  least  linear  classifiers,  which  we  shall  consider  here.  High- 
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leyman,  p.  1505,  shows  that  any  pattern  classifier  using  linear  discrimi¬ 
nant  functions  is  invariant  under  nons lngular  linear  transformations  of  the 
measurement  space.  But  an  appropriate  singular  (dimension  reducing)  trans¬ 
formation  can  improve  performance. 

Principal  components  applied  to  the  pooled  sample  sets  (Section  III -2) 
yields  a  linear  transformation  into  the  subspace  of  the  measurement  space 
spanned  by  the  principal  components.  This  transformation  acts  as  a  subop- 
timal  feature  selector  by  reducing  the  linear  redundancy  of  the  measurements. 
We  have  seen  that  the  coefficients  in  the  subspace  are  mutually  uncorrelated 
over  the  ensemble  of  all  the  categories.  This  fact  may  tend  to  simplify  the 
densities  in  the  principal  components  basis.  Also,  by  eliminating  low  vari¬ 
ance  components,  the  transformation  could  actually  eliminate  random  noise 
present  in  the  measurements.  But  its  primary  applicability  is  to  goal  (a) 

of  feature  extraction,  dimension  reduction.  In  an  application  to  crop  clas- 
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sification  (Pu  )  this  approach  was  compared  with  the  method  of  minimizing 
estimated  error  probabilities.  The  results  were  about  equal  if  more  than 
three  features  were  allowed. 

The  above  method  does  not  consider  class  membership  information  and 
so  could  discard  components  related  to  it.  The  natural  remedy  to  this 
danger  is  discriminant  analysis,  which  maximizes  the  variance  of  class  means 
relative  to  within  class  variance.  The  dimension  reduction  is  extreme,  since 
the  number  of  discriminants  is  one  less  than  the  number  of  categories.  Thus 
if  there  is  a  small  number  of  categories,  the  representation  In  the  discrim¬ 
inant  space  may  not  be  adequate  to  represent  complicated  densities.  We  shall 
see,  on  the  other  hand,  that  it  can  be  very  effective  for  problems  with  any 
degree  of  linear  separability. 

3.  Experimental  Results 

Principal  components  analysis  and  discriminant  analysis  were  applied 
to  a  classification  problem  involving  aircraft  radar  frequency  signatures. 

Each  sample  pattern  comprised  measurements  of  320  frequency  components.  Eight 
distinct  categories  were  represented  by  a  total  of  ?8l  samples.  The  samples 
of  each  category,  an  average  of  35,  were  divided  as  evenly  as  possible  into 
a  training  set  and  a  testing  set.  Mean  vectors  of  the  categories  and  the 
grand  mean  were  estimated  from  the  training  sets.  Principal  components  of 
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the  pooled  training  sets  were  estimated  by  the  approximation  method  described 
in  Section  II -3.  To  reduce  the  dimension  sufficiently  to  compute  estimated 
discriminant  vectors  for  the  training  sets,  the  samples  were  represented  in 
terms  of  the  *Mr,st  fO  principal  components.  The  representation  retained  97$ 
of  the  variance  of  the  training  data,  while  achieving  a  dimension  compression 
of  more  than  four  to  one.  Finally,  the  seven  discriminant  vectors  were  or- 
thogonalized  to  form  a  basis  for  the  discriminant  space. 

The  approximate  principal  components  and  the  orthonormal ized  discrimi¬ 
nants  both  span  subspaces  of  the  pattern  space  with  origin  at  u,  the  grand 
mean.  Representations  of  the  pattern  vectors  and  the  category  mean  vector 
estimates  in  these  subspaces  were  obtained  by  the  change  of  basis  transforma¬ 
tions 


X„  =  Y1  (X-u) 

XD  =  D 1  (XJ  =  D1'1  (X-u)  =  (YD)'  (X-u) 

where  X  is  a  vector  in  the  pattern  space  and  the  columns  of  and  D  are 
the  principal  components  and  the  discriminant  vectors. 

The  minimum  distance  classification  algorithm  described  in  Section 
IV-3  was  applied  to  the  test  patterns  directly  and  in  these  two  representa¬ 
tions.  The  error  rate  in  the  original  basis  was  26.3$. 

At  best,  the  principal  components  representation  improved  this  per¬ 
formance  only  negligibly,  to  25.5$.  On  the  other  hand,  It  did  at  least  as 
well  even  after  components  accounting  for  15$  of  the  variance  of  the  train¬ 
ing  data  (all  but  17)  had  been  discarded,  a  dimension  reduction  of  twenty 
to  one.  (Due  to  statistical  errors  in  the  estimation  of  the  principal  com¬ 
ponents,  It  is  likely  that  more  than  15$  of  the  variance  of  the  test  data 
was  ignored  in  this  representation.)  As  pven  more  of  the  principal  compo¬ 
nents  were  discarded,  the  error  rate  Increased  gradually  tc  30$  for  five 
vectors  (43$  of  variance  ignored)  and  rose  sharply  thereafter.  There  were 
only  two  instances  in  which  the  error  rate  actually  decreased  with  increased 
loss  of  variance,  at  dimension  22  and  7;  in  both  cases,  the  decrease  was 
slight.  These  results  are  reasonable  since  the  principal  components  rep¬ 
resentation  preserves  optimally  the  (squared)  lengths  of  the  patterns  and 
the  classifier  compares  distances  to  category  means,  which  are  lengths  of 
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difference  vectors.  Since  no  significant  improvement  in  classification  was 
achieved,  the  primary  value  of  principal  components  analysis  was  the  reduc¬ 
tion  of  the  dimension  so  that  discriminant  analysis  could  be  applied. 

As  expected,  the  performance  of  the  minimum  distance  classifier  im¬ 
proved  substantially  in  the  discriminant  basis,  with  an  error  rate  of  7*3$. 
This  error  is  attributable  to  statistical  error  in  estimation  of  the  dis¬ 
criminants  and  mean  vectors,  because  the  test  patterns  are  linearly  separable. 
This  was  demonstrated  by  computing  optimal  discriminants  directly  from  the 
test  samples;  that  representation  reduced  the  error  rate  to  zero.  For  pur¬ 
poses  of  comparison,  the  mean  vectors  and  principal  components  were  also 
computed  directly  from  the  test  data  classification  of  the  untransformed 
test  data  with  the  exact  means  resulting  in  an  error  rate  of  19.7$;  the 
difference  between  this  and  the  "honest"  error  rate  of  26.3#  can  be  attributed 
to  errors  in  estimation  of  the  means  from  the  training  data.  The  percent 
error  rates  are  summarized  in  the  box  below.  It  should  be  emphasized  that 
the  results  in  the  second  row  cannot  be  achieved  in  practice  and  are  in¬ 
cluded  only  to  point  out  the  estimation  problem. 
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Section  VI 

IMPLEMENTATION 

Most  of  the  above  methods  for  data  compression  and  classification 
have  been  implemented  on  the  Dynamic  Experimental  Processor  (DX-1)  at  the 
Multisensor  Signal  Processing  Branch,  Air  Force  Cambridge  Research  Lab¬ 
oratories.  The  hardware  configuration  includes  two  Digital  Equipment  Cor¬ 
poration  PDP-1  central  processors,  an  IBM  2311  disk  storage  unit,  several 
CRT  display  consoles  including  a  DEC  color  display,  and  a  core -buffered, 
line -generating  display  unit  called  the  Experimental  Display  Processor 
(XDP),  which  drives  two  of  the  consoles.  In  order  to  create  a  suitable 
environment  for  the  development  and  operation  of  the  computer  programs 
involved,  it  has  been  necessary  to  design  an  operating  system  which  provides 
the  appropriate  interactive  data  management  and  program  execution  capabil¬ 
ities. 

A  fundamental  requirement  is  the  ability  to  symbolically  identify 
files  of  vectors  on-line  for  random  access  storage  and  retrieval.  This 
is  accomplished  by  a  disk  based,  fixed  record  length  file  system.  Vari¬ 
able  length  information,  including  programs  and  relocatable  subroutines 
is  stored  in  partitioned  files. 

Programs  are  named,  stored,  loaded  and  executed  on-line  by  the  sys¬ 
tem  monitor.  Data  files  may  also  be  manipulated  through  the  monitor.  Each 
user  or  problem  is  assigned  a  code  which  assures  unique  identification  of 
his  partitioned  files  and  data  files.  Each  user  has  read/write  access  to 
his  own  files  and  read-only  access  to  all  others.  Thus  all  programs  and 
data  In  the  system  may  be  shared  by  all  users, 

A  convenient  means  of  visually  evaluating  the  results  of  data  repre¬ 
sentation  algorithms  is  provided  by  an  interactive  vector  display  program 
for  the  color  CRT.  Vectors  may  be  displayed  as  graphs  of  their  components 
or  as  projected  points  on  a  hyperplane  determined  by  an  arbitrary  pair  of 
vectors,  or  both,  under  user  control.  Commands  are  also  supplied  for  scal¬ 
ing  the  projected  images,  saving  them  in  random  access  storage,  and 
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rotating  the  projection  plane,  in  real  time,  to  a  plane  determined  by  two  new 
axes.  The  program  is  particularly  flexible  because  of  the  on-line  access  to 
all  data  vectors  by  file  name,  which  is  provided  by  the  vector  file  system. 

To  facilitate  the  programming  of  the  data  analysis  algorithms,  a  special 
purpose  language,  called  AMAC  (Assembler  Macros  for  Algebraic  Computations), 
has  been  designed.  Important  features  of  AMAC  include  a  run-time  storage 
allocation  capability,  vector  and  matrix  manipulation  instructions  and  a  com¬ 
prehensive  set  of  input/output  macros. 

Numerical  algorithms  for  the  system  have  been  programmed  to  be  as  mod¬ 
ular  as  possible,  to  allow  flexibility  in  choosing  the  processes  to  be  car¬ 
ried  out.  Therefore  many  of  the  techniques  described  in  earlier  sections  re¬ 
quire  execution  of  several  of  the  program  modules  described  in  this  section. 

To  simplify  the  execution  process,  it  would  be  useful  for  the  system  to 
"remember"  sequences  of  program  calls  which  could  then  be  activated  by  a 
single  monitor  command.  Steps  toward  this  goal  are  discussed  at  the  end  of 
this  section. 

1 .  Data  Management 

Random  access  storage  for  the  DX-1  system  is  an  IBM  2311  magnetic  disk 
storage  unit.  The  portion  of  the  operating  system  which  controls  storage 
and  retrieval  of  information  on  the  disk  is  called  the  Disk  File  System.  It 
stores  information  in  fixed  record  length  files,  which  is  a  particularly  con¬ 
venient  form  for  vector  data.  An  ID  table  is  maintained  which  contains  a 
unique  six  character  name  for  each  file  and  its  size  and  location  on  the  disk. 
(The  first  character  of  each  file  name  is  used  by  the  system  to  designate  the 
user  or  problem  to  which  the  file  belongs,  leaving  five  characters  to  be  sup¬ 
plied  by  the  user.)  When  a  file  is  created,  its  name  must  be  specified,  along 
with  its  record  length  (number  of  data  elements  in  each  record)  and  element 
length  (number  of  l8-bit  words  in  each  element).  These  parameters  are  fixed 
and  are  stored  in  the  file  system  ID  table  along  with  the  current  file  length 
and  the  physical  location  of  records  on  the  disk.  The  file  length  is  never 
specified  explicitly  and  may  be  increased  at  any  time  simply  by  writing  more 
records.  Records  may  be  rewritten  at  any  time. 

The  basic  Disk  File  System  commands  are  available  to  the  user  through 
the  on-line  monitor  and  to  programs  as  standard  I/O  macros.  They  are  the 
following: 


assign  -  specify  file  parameters  and  enter  file  name  in  ID  table 
rename  -  change  name  of  a  file  already  assigned 
delete  -  erase  file  and  its  ID  table  entry 

lookup  -  retrieve  file  parameters  (including  file  length)  from  ID 
table 

write  -  write  one  or  more  contiguous  records 
read  -  read  one  or  more  contiguous  records 
For  a  thorough  discussion  of  the  implementation,  characteristics  and  main¬ 
tenance  of  the  Disk  File  System,  see  Ref.  30. 

The  operating  system  also  includes  routines  which  handle  partitioned 
files  for  variable  length  information.  Each  partitioned  file  occupies  two 
ordinary  files,  a  table  of  contents  file,  and  a  file  for  the  actual  informa¬ 
tion.  Members  of  partitioned  files  are  given  six  character  alphanumeric 
names  which  are  unique  within  the  file.  The  file  names  are  determined  by 
the  user  Identification  and  a  type  code.  Partitioned  file  types  used  and 
anticipated  include  programs,  relocatable  subroutines,  source  programs, 
documentation  and  procedure  definitions.  Thus  all  programs  belonging  to  a 
given  user,  for  example,  are  stored  in  one  partitioned  file.  By  combining 
several  separate  elements  of  information  into  a  single  file,  partitioned 
files  increase  disk  space  utilization  and  reduce  average  access  time. 

2 .  The  On-Line  Monitor 

The  monitor  controls  user  identification,  program  storage,  loading 
and  execution,  as  well  as  on-line  operations  on  data  files.  When  using  the 
system,  each  individual  ordinarily  supplies  his  identification  character, 
which  is  added  to  his  data  file  names  and  partitioned  file  names.  One  char¬ 
acter,  1,  is  reserved  to  designate  library  files.  This  identification  ecdo 
must  be  used  in  order  to  update  library  files,  and  is  assumed  until  the  user 
provides  his  own  code.  The  user  is  allowed  to  assign,  rename,  delete  ana 
write  only  files  whose  names  are  prefixed  with  his  cede.  He  may  read  or  lock¬ 
up  files  prefixed  with  either  his  own  cr  the  library  code,  or  other  files  by 
supplying  an  overriding  prefix  code. 

Monitor  commands  are  issued  cn  the  console  typewriter  cr  Soroban  dis¬ 
play  keyboard,  using  only  lower  case  characters.  The  format  is  a  command 
symbol  followed  possibly  by  argument  symbols  separated  by  break  characters. 
Legal  symbols  may  contain  alphanumeric  characters,  period  or  minus.  All 
other  characters,  including  comma,  slash,  space,  tab,  and  carriage  return, 
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are  break  characters  which  terminate  symbols.  Two  special  break  characters 
are  used  to  erase  previous  input.  Backspaces  erase  previous  characters  and 
middle  dot  eraser,  the  entire  command.  Monitor  commands  and  their  descrip¬ 
tions  are  listed  on  page  39.  Optional  information  is  enclosed  in  brackets. 
The  single  character  prefix,  separated  by  a  slash,  overrides  the  current 
user  identification,  and  the  three  character  prefix  designates  a  partitioned 
file  type.  Each  of  the  commands  may  be  abbreviated  to  three  characters. 

The  mask  feature  should  be  explained  further.  The  parentheses  may  enclose 
up  to  five  characters,  any  of  which  may  be  blank.  All  files  are  selected 
whose  names  correspond  in  the  non-blank  characters.  If  no  characters  are 
contained  within  the  parentheses,  all  files  are  selected. 

3.  Dynamic  Color  Display  of  Vector  Data 

Graphic  manipulations  of  the  vector  data  fii.es  are  carried  out  by  a 
separate  display  program  which  is  accessed  through  the  monitor.  Its  pri¬ 
mary  function  is  vector  projection  of  one  or  more  files  of  vectors  on  a 
plane  determined  by  any  pair  of  non-colinear  vectors.  The  program  assigns 
a  different  color  to  each  file  to  allow  easy  identification  of  the  projected 
points.  Such  a  projection  may  then  be  scaled  up  or  down  or  transformed  con¬ 
tinuously  into  a  projection  on  a  new  pair  of  coordinate  axes.  These  func¬ 
tions,  together  with  the  data  compression  and  discrimination  algorithms 
already  described,  aid  the  user  in  discerning  statistical  relationships 
among  several  sets  of  data  vectors.  Properties  of  particular  vectors  may¬ 
be  presented  by  selecting  one  of  the  projected  points  with  the  light  pen. 
This  causes  a  graph  of  the  coordinates  of  the  projected  vector  to  be  dis¬ 
played  along  with  the  projection.  Graphs  of  the  corresponding  vector  in 
different  basis  representations  (and  therefore  different  disk  files)  may 
also  be  requested.  The  sequence  of  display  manipulations  is  determined  by 
Issuing  commands  on  the  display  console  keyboard.  The  effects  of  these 
commands  are  described  in  detail  below. 

Newdata:  The  program  initializes  the  display,  requests  a  list  of 
names  of  files  of  vectors  to  be  displayed,  checks  its  validity,  and  di¬ 
vides  all  the  data  by  the  norm  of  the  longest  vector  to  assure  that  all  pro¬ 
jections  will  fit  on  the  screen. 

Project:  The  program  asks  for  horizontal  and  vertical  axes,  which 
are  specified  by  vector  file  name  and  logical  record  index  within  each 
fil^.  These  vectors  are  normalized  to  unit  length  so  that  only  their 
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DX-1  MONITOR  COMMAND  FORMATS 


Program  Commands 


user 
user  u 

store  pgname, ial,fal,...,sa 


load  [u/]  pgname[,m] 
start 

call  [u/]pgname 
list  [u/TCpgname] 


newname  oldnam,newnam 
remove  pgname 


Examine  current  user  identification 

Supply  user  identification  letter 

Store  program  on  disk  -  la  is  initial 

address  and  fa  is  final  address  of  each 

block,  sa  is  start  address 

Load  program  into  module  m 

Start  loaded  program 

Load  and  start  program 

Type  names  of  programs  of  current  or  in 

dicated  user;  type  program  addresses  if 

pgname  is  specified 

Change  program  name 

Delete  program 


Data  File  Commands 


assign  fname, eltlen,reclen 


write  f name,  ia,  index, nrecs 


read  [u/]fname,  is,  index, nrecs 

rename  '[pft/loldnam,  newnam 
delete  [pft/]fname 
delete  (mask) 
lookup  [u/]rpft/][fname] 


lookup  [u/](mask) 


Assign  file  parameters  -  fname  is  file 
name,  eltlen  is  element  length,  reclen 
is  record  length 

Write  into  file  -  ia  is  octal  location 
of  first  record,  index  is  position  of 
record  in  file,  nrecs  is  number  of  records 
transferred 

Read  from  file  of  current  or  indicated 
user 

Change  file  cr  partition  name 

Delete  file  or  partition 

Delete  matching  files 

Type  file  or  partition  names  of  current 

or  indicated  user;  type  file  parameters 

if  fname  is  specified 

Type  matching  file  names  of  current  or 

indicated  user 


direct  ions  arc  used.  In  general  the  axes  are  not  orthogonal  and  covariant 
projection  is  used,  whereby  a  projected  point  Is  displayed  at  the  inter¬ 
section  of  the  normals  to  the  axes.  Projection  axes  typically  used  include 
principal  components,  discriminant  vectors,  category  mean  vectors  and  stan¬ 
dard  basis  vectors  (for  coordinate  projection). 

Scale:  Due  to  the  normalization  of  the  data,  the  projections  fre¬ 
quently  do  not  fill  the  displayable  area  of  the  color  CRT.  The  scale  com¬ 
mand  allows  the  picture  to  be  expanded  (or  contracted)  in  increments  spec¬ 
ified  by  the  user. 

Rotate:  This  command  transforms  the  current  projection  into  a  pro¬ 
jection  of  the  same  vectors  onto  a  plane  determined  by  a  new  pair  of  axes. 
Selection  and  normalization  of  the  axes  are  carried  out  as  in  "project." 

The  program  in  effect  generates  a  whole  sequence  of  projection  planes, 
whose  axes  are  located  at  equal  angular  intervals  between  the  original 
axes  and  the  new  axes.  The  projections  of  the  data  vectors  are  computed 
directly  only  on  the  new  axes.  The  projections  on  all  of  the  intermediate 
axes  are  computed,  in  real  time,  using  formulae  involving  sines  and  cosines 
of  sums  of  angles.  These  calculations  are  so  efficient  that  a  great  num¬ 
ber  of  intermediate  projections  may  be  generated  in  a  short  time,  even  for 
hundreds  of  data  vectors.  The  effect  produced  is  an  apparently  continuous 
rotation  of  the  plane  of  projection  The  user  controls  the  speed  (and 
smoothness)  of  this  rotation  by  his  choice  of  the  number  of  intermediate 
projections.  The  rotation  may  be  interrupted  midway  or  reversed  by  sense 
switch  control. 

This  feature  has  several  applications.  It  facilitates  the  compari¬ 
son  of  projection  planes  by  allowing  the  user  to  follow  the  movement  of 
individual  points  between  them.  It  makes  possible  visual  evaluation  of  the 
"stability"  of  a  projection  with  respect  to  perturbations  of  its  axes. 
Finally,  the  user  may  "explore"  the  vector  space  to  discover  projection 
planes  (on  Intermediate  axes)  which  may  appear  more  desirable  than  those 
which  are  directly  available. 

Graph:  Two  modes  are  available  for  selection  of  vectors  to  be  dis¬ 
played  in  their  component  representations.  Any  vector  already  projected 
may  be  pointed  out  on  the  display  screen  with  the  light  pen.  Alternatively, 
any  vector  stored  in  a  disk  file  may  be  indicated  by  file  name  and  record 
index.  Up  to  five  graphs  are  displayed  beside  the  current  projection. 

Their  colors  are  selected  by  the  user  to  aid  in  distinguishing  the  graphs 
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4.  Program  Development 

The  DX-1  Is  an  experimental  system  which  undergoes  frequent  hardware 
modification.  So  programs  are  coded  in  the  DX-1  MIDAS  assembly  language, 
which  can  be  easily  modified  to  accommodate  new  instructions  and  I/O  opera¬ 
tions.  Coding  of  complicated  mathematical  algorithms,  however,  is  difficult 
and  tedious  at  the  level  of  machine  instructions.  Therefore,  a  set  of  MIDAS 
macro  instructions,  called  AMAC,  has  been  written  which  provides  some  of  the 
power  of  an  algorithmic  language  without  sacrificing  the  flexibility  of  as¬ 
sembly  language. 

AMAC  includes  a  limited  form  of  arithmetic  statement  with  subscripted 
variables  which  may  also  contain  bit  manipulation  and  logical  operations. 

Among  other  FORTRAN-like  features  are  instructions  for  looping,  conditional 
execution  and  subroutine  linkage  and  a  library  of  arithmetic  function  sub¬ 
routines.  Unlike  FORTRAN,  AMAC  allows  run-time  storage  allocation.  Es¬ 
pecially  useful  in  this  effort  have  been  macros  which  call  subroutines  per¬ 
forming  matrix/vector  operations  typically  involved  in  statistical  applications, 
such  as  inner  products,  sums,  differences,  and  matrix  products. 

AMAC  contains  an  integrated  set  of  character-oriented  I/O  macros  for 
the  on-line  typewriter,  display  console  keyboard,  paper  tape  reader,  and 
punch  and  CRT  display.  Specific  devices  and  formats  are  specified  as  argu¬ 
ments  of  the  macros;  thus  the  effective  device  may  be  a  run-time  variable. 

Disk  i/o  is  performed  by  the  macros  described  in  Section  VI-1.  A  thorough 
description  of  AMAC  may  be  found  in  Ref,  31. 

The  program  modules  used  in  the  experiments  of  Section  V  are  executed 
by  the  monitor  commands  described  below.  Arguments  are  currently  requested 
individually  by  the  programs,  but  for  clarity  they  are  indicated  here  as 
lists.  Parentheses  contain  arguments  which  may  be  lists;  brackets  indicate 
arguments  which  may  be  omitted. 

call  intan(  (files),  dim,  rdim,  [gmean],  [switch], eigsys) 

Perforins  intrinsic  analysis  on  the  (pooled)  vectors  contained  in  files. 

The  first  dim  elements  of  each  vector  are  used;  rdim  eigenvalues  and 
eigenvectors  are  computed.  The  values  are  typed  out  on-1 Ine  and  the 
vectors  are  written  into  the  file  to  be  named  eigsys.  The  grand  mean 
is  written  in  file  gmean  if  the  name  is  supplied.  Intrinsic  analysis 
is  performed  if  switch  is  nonzero;  principal  components  (data  centered 
about  the  mean)  if  it  is  null. 


42 


call  chbas( ( If lies) .dim, [vector], f basis], (of lies) ) 

Performs  the  translation  and  change  of  basis  transformation 

output  =  basis'  ( input -vector) 

on  th«  first  dim  components  of  the  vectors  In  if lies.  The  dimension  of 
the  output  vectors  written  In  oflles  is  the  number  of  vectors  in  basis 
(its  file  length).  If  vector  is  null,  only  the  matrix  multiplication  is 
performed;  if  basis  is  null,  only  the  vector  difference.  This  program 
is  typically  used  to  transform  data  into  an  intrinsic  or  principal  com¬ 
ponents  basis. 

call  intanp( ( ifiles),dim,p,rdim, [gmean],swicch,prteig) 
call  chbasp ( ( if 1 les , d im, p, [gmean] , prte ig, ( of lies ) ) 

These  programs  are  used  together  to  perform  the  preliminary  reduction 
for  the  approximate  Intrinsic  analysis  described  in  Section  II-3.  The 
vectors  in  ifiles  are  partitioned  into  p  equal  segnents  to  produce 
segment  eigensystems  which  are  stored  in  prteig.  The  programs  and 
their  arguments  ifiles,  dim,  rdim,  mean,  switch  and  of lies  are  analogous 
to  those  described  above.  The  output  in  of lies  is  processed  by  intan 
and  chbas  to  complete  the  approximation  process, 
call  means( (files ) .means, [gmean 1 ) 

Computes  mean  vectors  for  all  input  files  and  writes  them  in  means;  writes 
the  grand  mean  in  gmean  if  specified, 
call  discrm( ( files) ,dim, [gmean], mean, dvecs ) 

Performs  linear  discriminant  analysis  on  the  first  dim  components  of  the 
vectors  in  files.  Each  file  represents  one  pattern  category.  Discrim¬ 
inant  vectors  are  written  in  file  dvecs;  eigenvalues  are  typed  on-line. 
Category  means  are  written  in  means;  the  grand  mean  is  written  in  gmean 
if  indicated, 
call  orthog( lfile, of ile) 

The  vectors  in  ifile  are  orthogonal ized  and  written  in  ofile. 
call  mind  Is ( ( files) , dim, means , [bas is] ) 

Applies  the  minimum  distance  classification  rule  to  the  first  dim  com¬ 
ponents  of  the  vectors  In  files.  The  vectors  in  means  are  used  as  pro¬ 
totype  patterns.  If  basis  is  not  null,  both  data  and  means  are  first 
transformed  into  it.  If  the  number  of  files  matches  the  number  of  means, 
error  percentages  are  printed  in  addition  to  a  confusion  matrix. 
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5.  Extensions  of  the  System 

One  of  the  goals  of  this  effort  has  been  to  provide  a  facility  for 
experimentally  discovering  or  verifying  sequences  of  analytical  methods  use¬ 
ful  in  data  reduction  and  classification  problems.  Due  to  the  modular  nature 
of  the  algorithms  developed  for  the  system,  the  typical  procedure  may  require 
calling  several  separate  programs.  The  user  must  remember  the  sequence  of 
programs  to  be  performed  along  with  the  argument  lists  of  each,  which,  as 
the  above  examples  indicate,  are  often  highly  redundant.  To  ease  the  burden, 
a  "cataloged  procedure"  facility  has  been  designed.  It  will  add  three  new 
commands  to  the  monitor:  define,  termin  and  exec.  The  procedure  name  is 
supplied  to  define  along  with  a  list  of  dummy  arguments.  Any  sequence  of  le¬ 
gitimate  monitor  commands  using  constant  or  dummy  arguments  follows.  The 
definition  is  ended  by  the  termin  command.  A  procedure  thus  defined  may  then 
be  invoked  by  the  exec  command  with  the  procedure  name  and  a  list  of  actual 
arguments . 

For  example 

define  intrep{  ilist/old,dim/dec,rdim/dec,eigsys/new,  o!  ist/new) 

call  intan  ( ilist,dimtrdim, , ,eigsys ) 

call  chbas  ( ilist.dim, , eigsys, olist } 

delete  eigsys 

tem:'  n 

will  make  possible  the  command 

exec  intrep  ( (f ilel.f ile2) , 120, 30, ,psi, ( fileli,file2i) ) 
which  produces  the  intrinsic  basis  representation  of  the  vectors  in  filel 
and  file2  and  then  deletes  the  eigensystem  from  the  disk. 

A  problem  which  arises  immediately  is  argument  screening.  If  argu¬ 
ments  are  to  be  supplied  all  at  once  in  lists,  there  is  a  strong  possibility 
of  out-of-order  arguments  which  would  cause  execution  errors.  Therefore,  an 
argument  processing  routine  will  be  added  to  the  operating  system  which  will 
screen  arguments  requested  by  programs  for  proper  type  and  format.  3o  that 
procedure  arguments  may  be  screened  before  the  programs  are  called,  their 
types  are  indicated  in  the  definition  by  characters  attached  to  the  dummy 
names.  The  argument  types  currently  recognized  are  the  following: 


Code 

Format  of  Argument 

old 

previously  assigned  file  id 

new 

id  of  file  assigned  by  this  procedure 

no  t 

octal  integer 

dec 

decimal  integer 

flo 

floating  point  number 

tex 

arbitrary  text  enclosed  in  brackets 

Finally,  as  the  library  o'*  programs  and  procedures  grows.  It  will  be 
Increasingly  desirable  to  provide  on-line  graphic  system  documentation.  This 
can  be  supplied  by  monitor  commands  to  list  the  names  of  available  procedures 
and  display  short  writeups  and  argument  descriptions  for  specific  programs 
and  procedures,  which  could  be  stored  conveniently  in  partitioned  files. 
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SUMMARY  AND  CONCLUSIONS 

This  research  has  extended  the  theoretical  foundations  and  developed 
the  practical  techniques  necessary  for  implementation  of  an  interactive 
multivariate  data  analysis  facility.  The  analytical  problems  treated  can 
be  broken  down  roughly  into  three  areas:  efficient  representation  of  high¬ 
dimensional  information,  representation  of  multicategory  information  for 
graphic  display  and  determination  of  optimal  or  satisfactory  decision  pro¬ 
cedures  for  data  classification.  Classical  multivariate  statistical  methods 
have  proven  valuable  in  these  applications.  Principal  components  analysis 
(or  intrinsic  analysis)  effects  compression  of  vector  data  with  minimum 
mean  square  error.  Linear  discriminant  anf lysis  produces  axes  which  maxi¬ 
mize  the  variance  of  multicategory  data  relative  to  the  variance  within 
categories. 

These  methods  have  been  implemented  on  the  Experimental  Dynamic  Pro¬ 
cessor  (DX-1)  at  APCRL,  using  state-of-the-art  algebraic  eigensystem  al¬ 
gorithms.  Also  a  new  eigensystem  approximation  technirue  has  been  devel¬ 
oped  which  allows  approximate  Intrinsic  analysis  to  be  applied  to  very  high 
dimensional  problems  which  would  otherwise  be  intractable  due  to  computer 
time  and  storage  requirements.  The  dimension  of  data  to  which  discriminant 
analysis  can  be  applied  is  limited  by  storage  requirements  and  by  the  number 
of  samples  available.  Both  of  these  limitations  have  been  overcome  by  first 
representing  the  data  in  a  truncated  intrinsic  basis.  This  has  also  re¬ 
duced  the  computation  time  required  for  the  discriminant  analysis. 

In  an  Interactive  computer  data  analysis  system,  these  methods  are 
valuable  In  displaying  Information  In  a  form  which  elucidates  its  statistical 
characteristics.  This  can  help  the  system  engineer  or  scientist,  determine 
the  structure  and  degree  of  complexity  of  his  problem.  Since  the  ultimate 
goal  of  most  data  analysis  systems  is  usually  to  Improve  a  real  world  de¬ 
cision  process,  the  applicability  of  these  methods  to  pattern  classification 
problems  was  considered.  The  minimum  risk  Bayes  strategy  for  pattern 
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recognition  was  reviewed,  as  well  as  the  concept  of  discriminant  functions 
and  their  equivalent  decision  surfaces  in  the  pattern  space.  It  was  also 
noted  that  when  the  measurements  which  make  up  the  patterns  are  very  numer¬ 
ous  and/or  highly  Interrelated,  many  pattern  classification  algorithms  cannot 
be  used  effectively,  it  is  then  necessary  to  transform  the  original  measure¬ 
ments  into  fewer  and/or  better  measurements.  This  process  is  called  feature 
extraction. 

To  test  the  applicability  of  intrinsic  analysis  and  discriminant  anal¬ 
ysis  to  feature  extraction,  a  commonly  used,  matched  filter-type  classifier 
was  chosen:  the  minimum  distance  rule,  which  classifies  an  unknown  pattern 
in  the  category  to  whose  mean  it  is  nearest.  (This  strategy  is  implemented 
by  linear  discriminant  functions  which  are  negatives  of  squared  distances 
from  category  means,  and  is  Bayes  optimal  when  the  categories  have  symmetric 
normal  probability  densities.)  The  minimum  distance  rule  was  applied  to  an 
aircraft  radar  frequency  signature  classification  problem  of  high  dimension 
with  (a)  no  feature  extraction,  (b)  feature  extraction  by  principal  compo¬ 
nents  analys  s,  and  ( c )  feature  extraction  by  discriminant  analysis.  Prin¬ 
cipal  components  analysis  greatly  reduced  the  number  of  measurements  but  did 
not  significantly  improve  the  performance  of  the  classifier.  Discriminant 
analysis  reduced  the  dimension  even  further  and  reduced  the  error  rate  sub¬ 
stantially.  These  results  are  to  be  expected,  since  principal  components  are 
vectors  which  preserve,  as  well  as  possible,  squared  distances  of  all  pat¬ 
terns  from  the  origin  in  the  pattern  space;  whereas  discriminant  vectors 
emphasize  variance  among  categories. 

All  of  the  transformations  described  above  for  data  reduction  and 
classification  are  linear.  They  performed  reasonably  well  because  the  cate¬ 
gory  densities  were  essentially  unimodal.  Other  experiments  have  shown  that 
their  performance  is  much  worse  when  more  complex,  multimodal  densities  are 
involved.  Nonlinear  methods  are  then  needed,  which  can  estimate  the  densi¬ 
ties  directly  in  order  to  implement  optimal  decision  rules.  To  this  end 
most  researchers  have  advocated  either  histograms  or  orthogonal  expansions. 

For  pattern  dimensions  greater  than  two,  the  former  are  too  cumbersome,  re¬ 
quiring  a  great  deal  of  manual  supervision,  and  the  latter  are  hopelessly 
complicated.  D.  Specht  has  successfully  employed  a  far  more  promising 
approach  involving  multinomial  expansions  of  multivariate  density  function 
estimates.  It  appears  that  future  worx  along  these  lines  should  be  directed 
toward  developmenc  and  refinement  of  his  technique. 


4 


nj 

Implementation  and  effective  utilization  of  an  interactive  data  anal¬ 
ysis  facility  using  the  methods  described  here  has  required  the  design  of  an 
operating  system  tailored  to  its  needs.  Its  features  include  a  disk-based 
vector  file  system  for  convenient  manipulation  of  vector  data,  an  on-line 
system  monitor,  a  dynamic,  color  vector  projection  program,  and  a  special 
purpose  programming  language.  The  need  for  such  special  purpose  software 
and  the  high  demands  in  computation  time  of  many  of  the  algorithms  involved 
indicate  that  such  systems  can  best  be  implemented  on  small  or  medium-scale 
dedicated  machines  rather  than  on  large-scale,  time-shared  configurations. 

The  color  CRT  provides  easy  identification  of  projected  points  by 
category,  which  is  valuable  in  classification  problems.  We  have  seen  that 
discriminant  analysis  can  determine  good  projection  planes  for  multicategory 
data.  It  should  be  noted,  however,  that  some  intrinsically  complex  problems 
may  not  be  sufficiently  well  represented  by  any  two  dimensional  projection. 
Therefore,  such  displays  should  be  used  to  augment  the  intuition  but  not  to 
draw  hard  conclusions  about  the  data  structures  involved. 

The  Implementation  of  the  algorithms  has  been  as  modular  as  possible 
to  allow  the  greatest  flexibility  in  their  application.  The  vector  pro¬ 
jection  program  allows  the  results  of  intermediate  results  to  be  visually 
evaluated.  Once  a  useful  sequence  of  operations  has  been  established,  it 
is  desirable,  for  simplicity  of  operation,  to  define  it  as  a  single  pro¬ 
cedure.  For  this  purpose,  future  additions  to  the  system  will  include  a 
cataloged  procedure  facility.  Also  needed  Is  a  provision  for  on-line  graphic 
documentation  of  programs  and  procedures. 
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IS.  ABSTRACT 

This  report  demonstrates  the  applicability  of  classical  statistical 
techniques  to  problems  involving  compression  and  classification  of  multivariate 
data.  The  theoretical  foundations  of  two  such  techniques,  intrinsic  analysis 
and  discriminant  analysis,  are  treated  in  detail.  Efficient  digital  computer 
implementation  is  discussed,  including  the  combined  application  of  intrinsic 
and  discriminant  analysis  and  a  new  algorithm  for  computing  approximate  intrinsic 
bases  for  very  large  problems.  Experimental  results  are  presented  on  the  appli¬ 
cation  of  these  techniques  as  feature  extractors  in  a  signal  classification 
problem.  Also  included  is  a  description  of  the  interactive  graphics -oriented 
system  software  which  has  been  developed  to  facilitate  the  application  of  these 
techniques. 
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